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Message from the Program Chair 
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Steven Levy, a panel on cellular network security, and a work-in-progress session. 
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It goes without saying that the USENIX staff were instrumental in making this year’s USENIX Security 
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Abstract 


SIF (Servlet Information Flow) is a novel software 
framework for building high-assurance web applications, 
using language-based information-flow control to en- 
force security. Explicit, end-to-end confidentiality and 
integrity policies can be given either as compile-time 
program annotations, or as run-time user requirements. 
Compile-time and run-time checking efficiently enforce 
these policies. Information flow analysis is known to 
be useful against SQL injection and cross-site scripting, 
but SIF prevents inappropriate use of information more 
generally: the flow of confidential information to clients 
is controlled, as is the flow of low-integrity information 
from clients. Expressive policies allow users and appli- 
cation providers to protect information from one another. 

SIF moves trust out of the web application, and into 
the framework and compiler. This provides application 
deployers with stronger security assurance. 

Language-based information flow promises cheap, 
strong information security. But until now, it could not 
effectively enforce information security in highly dy- 
namic applications. To build SIF, we developed new lan- 
guage features that make it possible to write realistic web 
applications. Increased assurance is obtained with mod- 
est enforcement overhead. 


1 Introduction 


Web applications are now used for a wide range of 
important activities: email, social networking, on-line 
shopping and auctions, financial management, and many 
more. They provide services to millions of users and 
store information about and for them. However, a 
web application may contain design or implementation 
vulnerabilities that compromise the confidentiality, in- 
tegrity, or availability of information manipulated by the 
application, with financial, legal, or ethical implications. 
According to a recent report [33], web applications ac- 
count for 69% of Internet vulnerabilities. Current tech- 
niques appear inadequate to prevent vulnerabilities in 
web applications. 

In general, information security vulnerabilities arise 
from inappropriate information dependencies, so track- 
ing information flows within applications offers a com- 
prehensive solution. Confidentiality can be enforced 


by controlling information flow from sensitive data to 
clients; integrity can be enforced by controlling infor- 
mation flow from clients to trusted information—as a 
side effect, protecting against common vulnerabilities 
like SQL injection and cross-site scripting. In fact, recent 
work [14, 19, 37, 15] on static analysis of PHP and Java 
web applications has used dependency analyses to find 
many vulnerabilities in existing web applications and 
web application libraries. Dynamic tainting can detect 
some improper dependencies and has also proved use- 
ful in detecting vulnerabilities [39, 6]. However, static 
analyses have the advantage that they can conservatively 
identify information flows, providing stronger security 
assurance [28]. 

Therefore, we have developed Servlet Information 
Flow (SIF), a novel framework for building web appli- 
cations that respect explicit confidentiality and integrity 
information security policies. SIF web applications are 
written in Jif 3.0, an extended version of the Jif program- 
ming language [21, 24] (which itself extends Java with 
information-flow control). The enforcement mechanisms 
of SIF and Jif 3.0 track the flow of information within 
a web application, and information sent to and returned 
from the client. SIF reduces the trust that must be placed 
in web applications, in exchange for trust in the servlet 
framework and the Jif 3.0 compiler—a good bargain be- 
cause the framework and compiler are shared by all SIF 
applications. 

The security policies used in SIF are both strong and 
expressive. Information flow is tracked through a type 
system that tracks all information flows, not merely ex- 
plicit flows. Security enforcement is end-to-end, because 
policies are enforced on information from when it en- 
ters the web application, to when it leaves, even as in- 
formation flows between different client requests. The 
security policies are expressive, allowing complex secu- 
rity requirements of multi-user systems to be enforced. 
Unlike prior frameworks for tracking information flow 
in web applications, policies can express fine-grained re- 
quirements for both confidentiality and integrity. Further, 
the interactions between confidentiality and integrity are 
controlled. 

The end-to-end security provided by information-flow 
control has long been appealing, but much theoretical 
work on language-based information flow has not yet 
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been successfully put into practice. We have identified 
limitations of existing security-typed languages for rea- 
soning about security in a dynamic external environment, 
and we have extended the Jif language with new features 
supporting these dynamic environments, resulting in a 
new version of the language, Jif 3.0. 

Information-flow control mechanisms work by label- 
ing information. In previous information flow mecha- 
nisms, the space of labels is essentially static. In ear- 
lier versions of Jif, for example, labels are expressed in 
terms of principals, but the set of principals is fixed at 
compile time. This is a serious limitation for web appli- 
cations, which often add new users at run time. Jif 3.0 
adds the ability for applications to create their own prin- 
cipals, dynamically extending the space of information 
labels. Moreover, Jif 3.0 allows applications to imple- 
ment their own authentication and authorization mecha- 
nisms for these application-specific principals—a neces- 
sity given the diversity of authentication schemes needed 
by different applications. Jif 3.0 also improves Jif’s abil- 
ity to reason about dynamic security policies, allowing, 
for example, web application users to specify their own 
security requirements at run time and have them enforced 
by the information flow mechanisms. These new mecha- 
nisms create new information channels, but Jif 3.0 tracks 
these channels and prevents their misuse. 

To explore the performance and usability of SIF, we 
developed two web applications with non-trivial security 
requirements: an email application specialized for cross- 
domain communication, and a multiuser shared calendar. 
Both applications add new principals and policies at run 
time, and both allow users to define their own informa- 
tion security policies, which are enforced by the same 
mechanisms used for compile-time policies. 

In summary, this paper makes three significant contri- 
butions: 


e It shows how to use language-based information 
flow to construct a practical framework for high- 
assurance web applications, in which information 
flow is tracked to and from clients, and users can 
specify and reason about information security. To 
our knowledge, this is the first implemented web ap- 
plication framework to strongly enforce both confi- 
dentiality and integrity. 

e It shows that application-defined mechanisms for 
access control and authentication, and a dynami- 
cally extensible space of labels, can be integrated 
securely with language-based information flow. 

e It describes the experience using these new mecha- 
nisms to build realistic web applications. 


The remainder of the paper is structured as follows. 
Section 2 gives an overview of the Servlet Informa- 
tion Flow framework, including some background on Jif. 


Section 3 introduces the new dynamic features in Jif 3.0, 
which enhance Jif’s ability to express and enforce dy- 
namic security requirements. Our experience with build- 
ing web applications in SIF is described in Section 4. 
Section 5 covers related work, and Section 6 concludes. 


2 Servlet Information Flow framework 


SIF is built using the Java Servlet framework [7], but 
presents a higher-level interface to web applications. 
Through a combination of static and dynamic mecha- 
nisms, SIF ensures that web applications use data only 
in accordance with specified security policies, by track- 
ing the flow of information in the server, and informa- 
tion sent to and from the client. Web applications in 
SIF are written entirely in Jif 3.0, an extended version of 
the security-typed language [36] Jif, in which types are 
annotated with information flow policies. Security poli- 
cies are enforced on information as it flows through the 
system, giving stronger security assurance than ordinary 
(discretionary) access control. 

In designing SIF, we faced two main challenges. The 
first was identifying information flows in web applica- 
tions, including information that flows over multiple re- 
quests. For example, a request sent to a server by a 
user may contain information about the user’s previous 
request and response. The second challenge was to re- 
strict insecure information flows while providing suffi- 
cient flexibility to implement full-fledged web applica- 
tions. The resulting framework is a principled approach 
to designing realistic, secure web applications. 

SIF is implemented in about 4040 non-comment, non- 
blank lines of Java code. An additional 960 lines of Jif 
code provide signatures for the Java classes that web ap- 
plications interact with. Jif signatures provide security 
annotations for Java classes, and expose only a subset of 
the actual methods and fields to clients. SIF web appli- 
cations are compiled against the Jif signatures, but linked 
at run time against the Java classes. Some Java Servlet 
framework functionality makes reasoning about infor- 
mation security infeasible. Using signatures and wrap- 
per classes, SIF necessarily limits access to this func- 
tionality, but without preventing implementation of full- 
fledged web applications. 

In this section, we first describe the threat model that 
SIF addresses, and the security assurances that SIF pro- 
vides. We present some background about Jif before de- 
scribing the design of SIF. 


2.1 Threat model and security assurance 


Threat model. We assume that web application clients 
are potentially malicious, and that web application im- 
plementations are benign but possibly buggy. Thus, we 
aim to ensure that appropriate confidentiality and in- 
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tegrity security policies are enforced on server-side in- 
formation regardless of the actions of clients, or the mis- 
takes of well-meaning application programmers. 

Although the Jif programming language prevents the 
unintentional violation of information security, it pro- 
vides mechanisms for explicit intentional downgrading 
of security policies (see Section 4.3). While a well- 
meaning programmer will be unable to accidentally mis- 
use these mechanisms, a malicious programmer may be 
able to subvert them, or use certain covert channels that 
Jif does not track (see Section 2.2). 

We do not address network threats, such as denial of 
service attacks, or the interception and alteration of data 
sent over the network. 

The Jif compiler and SIF are added to the trusted com- 
puting base, which already includes the servlet container, 
and the software stack required to run the servlet con- 
tainer. Note that SIF web applications are not part of 
the trusted computing base, whereas in standard servlet 
frameworks, web applications must be trusted. 


Security assurance. In a typical web application, secu- 
rity assurance consists of convincing each party with a 
stake in the system that the application enforces their se- 
curity requirements. Obviously users would like to have 
assurance that information they input will be confiden- 
tial, and information they view is not corrupted. The ap- 
plication provider (i.e., deployer) may also have confi- 
dentiality and integrity requirements for its information. 
Like other recent work on improving security of web ap- 
plications (e.g., [14, 18, 37, 15]), we focus on providing 
assurance to deployers. The difference here is that SIF 
enforces rich policies for information integrity and con- 
fidentiality, including policies provided by the user. 

Although we focus on providing assurance to deploy- 
ers, it is worth considering security assurance from a 
web application user’s perspective. Users must be con- 
vinced that they are communicating with an application 
that enforces their security requirements. The security 
validation offered by SIF effectively partitions the secu- 
rity assurance problem into two parts: first, ensuring that 
the application respects users’ security requirements, and 
second, ensuring the server users communicate with is 
correctly running the application. 

SIF addresses the first part of the assurance problem: 
verifying the security properties of web application code. 
SIF does not address the second part: convincing a re- 
mote client they are communicating with verified code. 
This step is important if the web application provider 
might be malicious. However, remote attestation meth- 
ods [34, 10, 30] seem likely to be effective in solving 
this second problem. Attestation methods could be used 
to sign application code, or alternatively, to sign a veri- 
fication certificate from a trusted SIF compiler that has 
checked the code. We leave integration of attestation 


mechanisms till future work. 

In any case, concern about malicious application 
providers should not be exaggerated; users’ willingness 
to spend money via web applications suggests they al- 
ready place a modicum of trust in them. This work aims 
to ensure this trust is justified. At a minimum, this means 
application deployers can be more confident in making 
possibly legally binding representations to their users. 

The SIF framework provides the following security as- 
surances to deployers of web applications. 


e SIF applications enforce explicit information secu- 
rity policies. In particular, SIF ensures that infor- 
mation sent to the client is permitted to be read by 
the client, thus ensuring that confidential informa- 
tion held on the server is not inadvertently released 
to the client. Further, information received from 
the client is marked as tainted by the client, help- 
ing prevent inappropriate use of low-integrity infor- 
mation. Thus, useful confidentiality and integrity 
restrictions are enforced in SIF applications. 

e The information security policies of back-end sys- 
tems (e.g., a database, file system, or legacy applica- 
tion) are also enforced, provided these systems have 
appropriate interfaces annotated with Jif 3.0 secu- 
rity policies. Thus, adding a web front-end to an 
existing system does not weaken the security assur- 
ance of that system, modulo the assumptions of our 
threat model. 

e Jif ensures that security policies on information 
are not unintentionally weakened, or downgraded. 
However, many web applications that handle sensi- 
tive information intentionally downgrade informa- 
tion as part of their functionality. As discussed fur- 
ther in Section 4.3, SIF web applications must sat- 
isfy rules that enforce selective downgrading [22, 
26] and robustness against all attackers [5], secu- 
rity conditions that provide strong information flow 
guarantees in the presence of downgrading. 

e SIF web applications can produce only well-formed 
HTML. While cascading style sheets and JavaScript 
may be used, they cannot be dynamically generated, 
and must be explicitly specified in the deployment 
descriptor, where they can be more easily reviewed 
by the application deployer. The deployer thereby 
gains assurance that a web application does not con- 
tain malicious client-side code. 


2.2 Background on Jif 


SIF web applications are written in Jif 3.0, a new ver- 
sion of the Jif programming language. To understand the 
design of SIF, some background on the Jif programming 
language is helpful. Readers familiar with Jif may skip 
this subsection. Details of some of the new features of 
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Jif 3.0 are given in Section 3. 

Jif is a security-typed language [36]: a type has a se- 
curity label L that describes restrictions on information 
at that type, which the compiler enforces. Security-type 
systems like that in Jif can enforce noninterference, en- 
suring that information labeled L can depend only on in- 
formation labeled L or with a less restrictive label [28]. 
In other words, information cannot leak from higher to 
lower levels, nor can untrusted information affect trusted 
information. Proofs for noninterference exist for numer- 
ous security-typed languages, but not for any language as 
expressive as Jif. Jif labels are based on policies from the 
decentralized label model (DLM) [22], in which princi- 
pals express ownership of information-flow policies. 

A principal is an entity with security concerns, and 
the power to observe and change certain aspects of the 
system. In a web application, principals may be users 
of the application, user groups, or even the web appli- 
cation itself; SIF applications may choose which entities 
to model as principals. Web application principals may 
have different security concerns, and do not necessarily 
trust each other. By allowing principals to have different 
security policies, the DLM can express security concerns 
of mutually distrusting principals. 

A principal p may delegate to another principal q, in 
which case q is said to act for p. The acts-for relation 
is reflexive and transitive, and is similar to the speaks- 
for relation [16]. The acts-for relation is needed to ex- 
press trust relationships between principals, and can en- 
code groups and roles. Jif supports a top principal T able 
to act for all principals, and a bottom principal that al- 
lows all principals to act for it. A principal may also grant 
its authority to code, meaning the code is trusted to per- 
form actions such as declassification that could violate 
the principal’s information security. 

Jif labels are constructed from reader policies and 
writer policies [5]. A reader policy o> 11, ..., Pn Means 
that principal o owns the policy, and o permits any princi- 
pal that can act for any 1; (or o itself) to read the data. For 
example, the reader policy T — p says that the top prin- 
cipal permits p to observe information. A writer policy 
.., Wn 1S owned by principal o, and o has per- 
mitted any principal that can act for any of w ,..., Wn, 
or o to have influenced (“written”) the data. 

Reader policies restrict to which principals informa- 
tion may flow, whereas writer policies describe from 
which principals information may have flowed. Reader 
policies thus describe confidentiality, and writer policies 
describe integrity (provenance) of information. 

A Jif label is a pair of a confidentiality policy and an in- 
tegrity policy, written {c ; d} for confidentiality policy c 
and integrity policy d. The set of confidentiality policies 
is formed by closing reader policies under conjunction 
and disjunction, denoted LI and I respectively. The con- 


Oo U1,. 


junction of two confidentiality policies, c; LI cz, enforces 
the restrictions of both c; and cg. Thus, the readers per- 
mitted by c; L! cg is the intersection of readers permitted 
by c, and cy. Similarly, the readers permitted by the dis- 
junction c; [ cg is the union of readers permitted by c; 
and cg. Integrity policies are formed by closing writer 
policies under conjunction and disjunction. Dually to 
confidentiality, conjunction and disjunction are respec- 
tively denoted M and LU. 

For example, in the label { Alice — Bob U Chuck — 
Bob, Dave ; Alice — Th}, the confidentiality policy 
is the join of two reader policies, Alice — Bob and 
Chuck — Bob, Dave. Thus, information with this la- 
bel can be read only by principals that can act for at 
least one of Alice or Bob, and at least one of Chuck, 
Bob, or Dave; clearly, Bob is one such principal. The 
integrity policy of the label consists of a single writer 
policy, owned by Alice, stating that Alice believes the 
data has been influenced only by principals able to act 
for Alice or the top principal T. SIF uses confidentiality 
policies to restrict what information is sent to the client, 
and integrity policies to restrict how information received 
from the client is used. 





Secure information flow requires that the label on a 
piece of information can only become more restrictive 
as the information flows through the system. Given la- 
bels L and L’, we write L C L’ if the label L’ restricts 
the use of information at least as much as L does. To 
handle computations that combine information from dif- 
ferent sources, the label L; LJ Lz imposes the restrictions 
of both L; and Lz. 

The types of variables and expressions in Jif programs 
include labels. For example, a value with type int{o— 
r;1<— L} is an integer with label {or ; LL}: it 
can be read only by principals that can act for r or o, and 
has the lowest possible integrity. A Jif programmer may 
annotate the type declarations of fields, variables, and 
methods with labels; use of fields, variables, and meth- 
ods must comply with the label annotations. For types 
left unannotated, the Jif compiler either chooses default 
labels, or automatically infers labels, thus reducing the 
annotation burden on the programmer. 





Although a Jif programmer may annotate a program 
with arbitrary labels, he does not have complete control 
over security. Labels must be internally consistent for the 
program to type-check, and moreover, the labels must be 
consistent with security policies from the external envi- 
ronment. In SIF, a web application interacts with the 
external environment through the SIF interfaces, as well 
as interfaces for back-end services (e.g., databases). 

Jif’s type system prevents labeled information from 
being unintentionally downgraded, or assigned a less- 
restrictive label. Downgrading confidentiality increases 
the set of principals permitted to read the information, 
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Figure 1: Handling a request in SIF. 


whereas downgrading integrity reduces the set of prin- 
cipals considered to have influenced the information. 
The type system prevents unintentional downgrading by 
tracking the data dependencies (information flow) in the 
program, including implicit flows [8]: covert storage 
channels that arise from program control structure. Jif 
does permit information to be intentionally downgraded, 
but any code that does so requires the authority of all 
principals whose reader or writer policies are weakened 
or removed as a result of the downgrading. 


Timing and synchronization channels. Jif’s type sys- 
tem does not track information flow via timing or ter- 
mination channels. These covert channels are not a se- 
rious concern if web applications are not implemented 
by adversaries; we assume that application programmers 
are not malicious. Other work (e.g., [1, 31]) has investi- 
gated checking and transforming security-typed code to 
remove timing channels. Termination channels (which 
can be regarded as an extreme timing channel) are low- 
bandwidth, leaking at most one bit per interaction with 
the web application, that is, one bit per request. 

Jif was developed assuming a single-threaded execu- 
tion model. However, SIF web applications are multi- 
threaded Jif programs, and thread synchronization can 
create covert timing channels that transmit information. 
This risk can be mitigated by configuring the web server 
to handle at most one concurrent request per servlet, 
or by isolating concurrent requests or sessions in the 
protection domains offered by some Java run-time sys- 
tems [12, 3]. 


2.3 Design of SIF 


Like the Java Servlet framework, SIF allows application 
code to define how client requests are handled. However, 
there are some structural differences that facilitate the ac- 
curate tracking of information flow. Figure | presents an 


overview of how SIF handles a request from a web client: 


1. An HTTP request is made from a web client to a 
servlet; 

2. The HTTP request is wrapped in a Request object; 

3. An appropriate Action object of the servlet is 
found to handle the request, and its invoke method 
called with the Request object; 

4. The action’s invoke method generates a Page ob- 
ject to return for the request; 

5. The Page object is converted into HTML, which is 
returned to the client. 


Step 1: HTTP request from web client to servlet. Web 
applications must extend the class Servlet, which is 
similar to the HttpServlet class of the Java Servlet 
framework. Figure 2 shows a simplified Jif signature for 
the Servlet class, as well as other key classes of SIF. 
The important aspects of these signatures are explained 
as they arise, but because of space limitations, the syntax 
of Jif methods and fields are not fully explained. 

Web clients establish sessions with the servlet; ses- 
sions are tracked by the servlet container, as in the Java 
Servlet specification. The SIF framework creates a ses- 
sion principal for each session, which can be thought of 
as corresponding to the session key shared between the 
client and server [16], if such a key exists. The applica- 
tion would typically define its own user principals, which 
can delegate to the session principal. 


Step 2: HTTP request wrapped in a Request object. 
The class Request is a SIF wrapper class for an HTTP 
request, providing restricted access to information in the 
request, via the getParam method. The restricted in- 
terface ensures that web applications are unable to cir- 
cumvent the security policies on data contained in the 
request, as described below. 





USENIX Association 


16th USENIX Security Symposium 


abstract class Servlet { 
// allows servlets to specify 
// a default action 
protected Action{req} defaultAction(Request req) ; 


// allows servlets to create a 
// serulet-specific SessionState object 
protected SessionState createSessionState(); 


public void setReturnPage{*:req.session}( 
Request{*:req.session} req, 
label out, label in, 
Node[out,in]{*in} page) 
where {*out;*in} <= {*:req.session}; 


} 


abstract class Action { 
public abstract void 
invoke{*1b1}(label{*1b1} 1bl, 
Request{*lbl} req) 
where caller(req.session) ; 


} 


// base class of HTML elements 
abstract class Node[label Out, label In] { } 


final class Request { 
// principal representing the session 
// between client and server 
public final principal session; 


// reference to the Servlet 
public final Servlet servlet; 


// acquire a parameter value from the Request 
public String{*inp.LU inp (1—L; T«session)} 
getParam(Input inp); 


// obtain a reference to 
// the SessionState object 
public SessionState getSessionState(); 


} 


final class Input { 
private final Nonce n; 
public final label L; 
} 
abstract class InputNode[label Out, label In] 
extends Node[Out,In] { 
// framework statically enforces Out U In CL inp.L 
private final Input{L} inp; 





} 


Figure 2: Jif signatures for SIF classes 


Step 3: An Action is found and invoked. Web applica- 
tions implement their functionality in actions, which are 
application-defined subclasses of the SIF class Action. 
A SIF servlet may have many action objects associated 
with it; each action object belongs to a single servlet. 

Actions can be used as the targets of forms and hyper- 
links. For example, the target of a form is an action ob- 
ject responsible for receiving and processing the data the 
user submits via the form. This mechanism differs from 
the standard Java servlet interface, which requires the ap- 
plication implementor to write explicit request dispatch- 
ing code (the doGet method). However, explicit dispatch 
code in the application makes precise tracking of infor- 
mation flow difficult, as the dispatch code is executed for 
all requests, even though different requests may reveal 
different information. By avoiding dispatch code, the 
action mechanism permits more precise reasoning about 
the information revealed by client requests to the server, 
as discussed further in Section 2.4. 

Action objects may be session-specific actions, which 
can only ever be used by a single session, or they may 
be external actions not specific to any given session. All 
action objects within a given servlet have a unique iden- 
tifier. For session-specific actions, the identifier is a se- 
cure nonce, automatically generated by the framework 
on construction of the action. For external actions, the 
identifier is a (human-readable) string specified by the 
web application. Since external actions have fixed iden- 
tifiers, they may be the target of external hyperlinks, such 
as a hyperlink in static HTML on a different web site. 

When an HTTP request is received by a servlet, the 
framework finds a suitable action to handle it. Typically, 
the HTTP request contains a parameter value specifying 
the unique identifier of the appropriate action; for exam- 
ple, forms generated by the servlet identify the action to 


which the form is to be submitted. If the HTTP request 
does not contain an action’s unique identifier, then a de- 
fault action specified by the Servlet .defaultAction 
method is used to handle the request. This default is use- 
ful for handling the first request of a new session. If the 
HTTP request contains an invalid action identifier (e.g., 
the identifier of a session-specific action of an expired 
or invalidated session), an error page is returned, which 
then redirects the user to the default action. 

Actions allow web applications to maintain control 
over application control flow. Because session-specific 
actions are named with a nonce, other sessions cannot 
invoke them. In addition, SIF tracks the active set of 
actions for each session. An error page is returned if a 
request tries to invoke an action that is not active. The 
active set contains all external actions, and all session- 
specific actions that were targets of hyperlinks and forms 
of the last response. Thus, a client by default cannot re- 
submit a form by replaying its (inactive) action identifier. 

Once the appropriate action object has been found, the 
invoke method is called on it with a Request object 
as an argument. The invoke method executes with the 
authority of the session principal, as shown by the where 
caller (req.session) annotation in Figure 2. 

Web applications implement their functionality in the 
action’s invoke method, as Jif 3.0 code. If required, 
the invoke method can access back-end services (e.g., 
a database) provided that suitable Jif interfaces exist for 
the services. For example, web applications can access 
the file system since the Jif run-time library provides a Jif 
interface for it, which translates file system permissions 
into Jif security policies. 

SIF web applications can provide secure web inter- 
faces to legacy systems, by accessing the legacy sys- 
tems as back-end services. The information security of 
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these systems is not compromised by allowing SIF appli- 
cations to access them, since all accesses from Jif code 
must conform to the system’s Jif interface. 


Step 4: The invoke method generates a Page object. 
An object of the class Page is a representation of an 
HTML page. SIF uses the class Node to represent HTML 
elements; the class Page, and other HTML elements, 
such as Paragraph and Hyperlink, are subclasses of 
Node. Nodes may be composed to form trees, which 
represent well-formed HTML code. The class Node is 
parameterized by two labels, Out and In. The Out la- 
bel is an upper bound on the labels of information con- 
tained in the node object and its children. For example, 
an HTML body may contain several paragraphs, each of 
which contains text and hyperlinks; the Out parameter 
of each Paragraph node is at least as restrictive as the 
Out parameters of its child Nodes. The In parameter is 
used to bound information that may be gained by from 
subsequent requests originating from this page, and is 
discussed further in Section 2.4. 

The Action.invoke method must generate a Page 
object, and call Servlet.setReturnPage with that 
Page object as an argument. The signature for 
Servlet .setReturnPage ensures that the Out param- 
eter of the Page is at most as restrictive as the label 
{T — req.session; | « |}, where req.session is 
the session principal. This label is an upper bound on 
all labels that permit the principal req. session to read 
information, and thus the Page object returned for the re- 
quest can contain only information that the session prin- 
cipal is permitted to view. This restriction is enforced 
statically through the type-system, and requires no run- 
time examination of labels by the SIF framework. Thus, 
assurance is gained prior to deployment that confidential 
information on the server is not inadvertently released. 

In addition, by requiring the application to produce 
Page objects instead of arbitrary byte sequences, SIF can 
ensure that each input field on a page has an appropriate 
security policy associated with it (see Section 2.4), and 
that the web application serves only well-formed HTML 
that does not contain possibly malicious JavaScript. 


Step 5: The Page is converted into HTML. SIF con- 
verts the Page object into HTML, which is sent to the 
client. The Page object may contain hyperlinks and 
forms whose targets are actions of the servlet; SIF en- 
sures that the HTML output for these hyperlinks and 
forms contain parameter values specifying the appropri- 
ate actions’ unique identifiers; if the user follows a hyper- 
link or submits a form, the appropriate action is invoked. 


2.4 Information flow over requests 


The Jif compiler ensures that security policies are en- 
forced end-to-end within a servlet, that is, from when a 


request is submitted until a response is returned. How- 
ever, information may flow over multiple requests within 
the same session, for example, by being stored in session 
state, or by being sent to a (well-behaved) client that re- 
turns it in the next request. SIF tracks information flow 
over multiple requests, to ensure that appropriate security 
labels are enforced on data at all times. 


Information flow through parameter values. SIF re- 
quires each input field on a page to have an associated 
security label to be enforced on the input when submit- 
ted. This label is statically required to be at least as re- 
strictive as the label of any default value for the input 
field, to prevent a default value from being sent back to 
the server with a less restrictive policy enforced on it. 
SIF ensures that the submitted value of an input field 
has the correct label enforced on it by preventing ap- 
plications from arbitrarily accessing the HTTP request’s 
map from parameter keys to parameter values.  In- 
stead, when an input field is created in the outgoing 
Page object, an Input object is associated with it. An 
Input object is a pair (n,L), where n is a freshly gen- 
erated nonce, and L is the label enforced on the input 
value. An application can retrieve a data value from an 
HTTP request only by presenting the Input object to 
the Request.getParam(Input inp) method, which 
checks the nonce, and returns the submitted value with 
label inp.L enforced on it. This “closes the loop,” en- 
suring that data sent to the client has the correct security 
enforced on it when the client subsequently sends it back. 
SIF does not try to protect against the user copying 
sensitive information from the web page, and pasting into 
a non-sensitive input field. That is impossible in general, 
and the application should define labels that prevent the 
user from seeing information that they are not trusted to 
see. By keeping track of input labels, SIF prevents web 
applications from laundering away security policies by 
sending information through the client. As discussed in 
Section 2.5, the user can also inspect the labels on inputs 
to see how the application will treat the information. 
The getParam method signature also ensures that the 
label {L — L ; T < session} is enforced on val- 
ues submitted by the user. This label indicates that 
the value has been influenced by the session princi- 
pal. Thus, SIF ensures that the integrity policy of any 
value obtained from the client correctly reflects that the 
client has influenced it; the Jif 3.0 compiler then ensures 
that this “tainted,” or low-integrity, information cannot 
be incorrectly used as if it were “untainted,” or high- 
integrity. This helps avoid vulnerabilities such as SQL 
injection, where low-integrity information is used in a 
high-integrity context. 
Information flow through session state. Java servlets 
typically store session state in the session map of 
the class javax.servlet .http.HttpSession. How- 
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ever, direct access to the session map would allow 
SIF applications to bypass the security policies that 
should be enforced on values stored in the map. In- 
stead, SIF web applications may store state in fields 
of session-specific actions, or in an application-defined 
subclass of SessionState. Since fields must have 
labels, the Jif compiler ensures that web applica- 
tions honor labels associated with values stored in 
the state. Web applications may override the method 
Servlet .createSessionState to create an appropri- 
ate SessionState object; SIF ensures at run time that 
this method is called exactly once per session. 


Information flow through action invocation. A sub- 
tlety of the framework is that the very act of invoking an 
action, by following a hyperlink or submitting a form, 
may reveal information to the web application. For ex- 
ample, if a hyperlink to some action a is generated if and 
only if some secret bit is 1, then knowing that a is in- 
voked reveals the value of the secret bit. 

To account for this information flow, the 
Action.invoke method takes two arguments: a 
label 1bl, and a reference to the Request object. The 
label 1b1 is an upper bound on the information that may 
be gained by knowing which action has been invoked. 
This means that 1b1 must be at least as restrictive as 
the output information for the hyperlink or form used 
to invoke the action. In our example, the value of 1b1 
when invoking a would be at least as restrictive as the 
label of the secret bit. In general, the value for 1b1 is 
the value of the In parameter of the Node that contains 
the link to the action; the constructors for the Node 
subclasses ensure that the parameter In correctly bounds 
the information that may be gained by knowing the node 
was present in the Page returned for the request. 

The method signature for Action.invoke ensures 
that the security label 1b1 is enforced on the reference 
to the Request object (“...Request{*lb1} req...”) and 
that 1b1 is a lower-bound for observable side-effects of 
the method (“invoke{*1b1}(...)”), meaning that any 
effects of the method (such as assignments to fields) must 
be observable only at security levels bounded below by 
1bl. These restrictions ensure that SIF correctly tracks 
the information that may be gained by knowing which 
actions were available for the user to invoke. 


2.5 Deploying SIF web applications 


SIF web applications may be deployed on standard Java 
Servlet containers, such as Apache Tomcat, and thus may 
be used in a multi-tier architecture wherever Java servlets 
are used. The SIF and Jif run-time libraries must be 
available on the class path, but deployment of SIF web 
applications is otherwise similar to deployment of ordi- 
nary Java servlets. The deployer of a SIF web application 


is free to specify configuration information in the appli- 
cation’s deployment descriptor (the web.xml file). For 
example, the deployer may require all connections to use 
SSL, thus protecting the confidentiality and integrity of 
information in transit between client and server. Addi- 
tionally, there are several SIF-specific options that a de- 
ployer may specify in the deployment descriptor. 


Cascading style sheets. SIF applications must use the 
Node subclasses to generate responses to requests, which 
allows them to generate only well-formed HTML. To al- 
low flexibility in presentation details such as colors and 
font attributes, SIF permits the deployment descriptor 
to specify a cascading style sheet (CSS) to use in the 
presentation of all HTML pages generated by the ap- 
plication; SIF adds this URL in the head of all gener- 
ated HTML pages. Node objects can specify a class 
attribute, allowing style sheets to provide almost arbi- 
trary formatting. While this allows great flexibility, care 
must be taken that the CSS does not contain mislead- 
ing formatting. For example, inappropriate formatting 
might lead a user to enter sensitive information into a 
non-sensitive input field, such as a social security num- 
ber into an address field. The deployer should review the 
CSS before deploying the application. 


JavaScript. Dynamically generated JavaScript can pro- 
vide rich user interfaces, but introduces new possibili- 
ties for security violations and covert channels. SIF does 
not allow web applications to send dynamic JavaScript 
to the client. However, as with CSSs, SIF allows deploy- 
ment descriptors to specify a URL containing (static) 
JavaScript code to be included on all generated HTML 
pages. Explicit inclusion of JavaScript permits easy re- 
view by the deployer. Ideally, SIF should automatically 
check included JavaScript code (or perhaps an extension 
of JavaScript with information-flow control); we leave 
this to future work. 


Policy visualization. User awareness of security poli- 
cies is an important aspect of secure systems. Since SIF 
tracks the policies of information sent to the user, SIF 
can augment the user interface to inform the user of the 
security policies of data they view and supply. Provided 
the user trusts the interface (see Section 2.1), this helps 
prevent, for instance, a user from inappropriately copy- 
ing sensitive information from the browser into an email, 
or from following an untrusted hyperlink. 

Web applications may opt to allow SIF to automati- 
cally color-code information sent to the client, based on 
policy annotations. When the user presses a hotkey com- 
bination, JavaScript code recolors the page elements to 
reflect their confidentiality, varying from red (highly con- 
fidential) to green (low confidentiality). Both displayed 
information and inputs are colored appropriately. An ad- 
ditional hotkey colors the page based on the integrity 
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policies of information. A third hotkey shows a legend of 
colors and corresponding labels so the user can identify 
the precise security policy for each page element. 


3 Language extensions 


Web applications have diverse, complicated, and dy- 
namic security requirements. For example, web appli- 
cations display a plethora of authentication schemes, in- 
cluding various password schemes, password recovery 
schemes, biometrics, and CAPTCHAs to identify human 
users. Web applications often enforce dynamic security 
policies, such as allowing users to specify who may view 
and update their information. Moreover, the security en- 
vironment of a web application is dynamic: new users 
are being created, users are starting and ending sessions, 
and authenticating themselves. 

In order both to accommodate diverse, complicated, 
and dynamic security requirements, and to provide as- 
surance that these requirements are met, we have pro- 
duced Jif 3.0, a new version of Jif. Section 2.2 describes 
the previous version of Jif; this section presents new fea- 
tures that support dynamic security requirements: inte- 
gration of information flow with application-defined au- 
thentication and authorization, and improved ability to 
reason about and compute with dynamic security labels 
and principals. 

Care was needed in the design and implementation of 
these language extensions, since there is always a tension 
in language-based security between expressiveness and 
security. In particular, the new dynamic security mech- 
anisms in Jif 3.0 create new information channels, com- 
plicating static analysis of information flow. Importantly, 
Jif 3.0 tracks these channels to prevent their misuse. 


3.1. Application-specific principals 
Principals are entities with security concerns. Applica- 
tions may choose which entities to model as principals. 
Principals in Jif are represented at run time, and thus can 
be used as values by programs during execution. Jif gives 
run-time principals the primitive type principal. Jif 
3.0 introduces an open-ended mechanism that allows ap- 
plications great flexibility in defining and implementing 
their own principals. 

Applications may implement the Jif 3.0 interface 
jif.lang.Principal, shown in simplified form in 
Figure 3. Any object that implements the Principal in- 
terface is a principal; it can be cast to the primitive type 
principal, and used just as any other principal. The 
Principal interface provides methods for principals to 
delegate their authority and to define authentication. 

Delegation is crucial. For example, user principals 
must be able to delegate their authority to session princi- 
pals, so that requests from users can be executed with 


interface Principal { 
String name(); 


// does this principal delegate authority to q? 
boolean delegatesTo(principal q); 


// is this principal prepared to authorize the 

// closure c, given proof object authPrf? 

boolean isAuthorized(Object authPrf, 
Closure[this] c); 


// methods to guide search for acts-for proofs 
ActsForProof findProofUpTo(Principal p); 
ActsForProof findProofDownTo(Principal q); 

+ 

interface Closure[principal P] authority(P) { 
// authority of P is required to 
// invoke a Closure 
Object invoke() where caller(P); 

+ 


Figure 3: Signatures for application-specific principals 


their authority. The method call p.delegatesTo(q) 
returns true if and only if principal p delegates its au- 
thority to principal q. The implementation of a prin- 
cipal’s delegatesTo method is the sole determiner of 
whether its authority is delegated. An acts-for proof 
is a sequence of principals p1,...,Dn, such that each 
p; delegates its authority to p;,,, and is thus a proof 
that p, can act for p;. Acts-for proofs are found using 
the methods findProofUpTo and findProofDownTo 
on the Principal interface, allowing an application to 
efficiently guide a proof search. Once an acts-for proof 
is found, it is verified using delegatesTo, cleanly sepa- 
rating proof search from proof verification. 

The authority of principals is required for certain oper- 
ations. For example, the authority of the principal Alice 
is required to downgrade information labeled { Alice > 
Bob ; T<—Th} to the label { Alice > Bob, Chuck ; T<— 
T } since a policy owned by Alice is weakened. The au- 
thority of principals whose identity is known at compile 
time may be obtained by these principals approving the 
code that exercises their authority. However, for dynamic 
principals, whose identity is not known at compile time, 
a different mechanism is required. We have extended Jif 
with a mechanism for dynamically authorizing closures. 

An authorization closure is an implementation of the 
interface jif .lang.Closure, shown in Figure 3. The 
Closure interface has a single method invoke, and 
is parameterized on a principal P. The invoke method 
can only be called by code that possesses the author- 
ity of principal P, as indicated by the annotation where 
caller (P). Code that does not have the authority of 
principal P can request the Jif run-time system to exe- 
cute a closure for P; the run-time system will do so only 
if P authorizes the closure. 

The Principal interface provides a method for au- 
thorizing closures, isAuthorized. It takes two argu- 
ments: a Closure object instantiated with the princi- 
pal represented by the this object, and an application- 
specific proof of authentication and/or authorization. 
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For example, the proof might be a password, a check- 
able proof that the closure satisfies certain safety re- 
quirements, or a collection of certificates or capabil- 
ities. The application-specific implementation of the 
isAuthorized method examines the closure and the 
proof object, and returns true if the principal grants its 
authority to the closure. 

The Principal interface and authorization closures 
provide a flexible mechanism for web applications to 
implement their own authentication and authorization 
mechanisms. For example, in the case studies of Sec- 
tion 4, closures are used to obtain the authority of ap- 
plication users after they have authenticated themselves 
with a password. Other implementations of principals 
are free to choose other authentication and authorization 
mechanisms, such as delegating the authorization deci- 
sion to a XACML service. Dynamic authorization tests 
introduce new information flows that are tracked using 
Jif’s security-type system. To prevent the usurpation of a 
principal’s authority, the Jif run-time library cannot exe- 
cute a closure unless appropriately authorized. 

Legacy systems may have their own abstractions for 
users, authentication, and authorization. Application- 
specific principals allow legacy-system security abstrac- 
tions to be integrated with web applications. For exam- 
ple, when integrating with a database with access con- 
trols, database users can be represented by suitable im- 
plementations of the Principal interface; web appli- 
cations can then execute queries under the authority of 
specific database users, rather than executing all queries 
using a distinguished web server user. 


3.2 Dynamic labels and principals 


Jif can represent labels at run time, using the primitive 
type label for run-time label values. Following work 
by Zheng and Myers [42], Jif 3.0’s type system has been 
extended with more precise reasoning about run-time la- 
bels and principals. It is now possible for the label of a 
value (or a principal named in a label) to be located via a 
final access path expression. A final access path expres- 
sion is an expression of the form r.f,.....f£,, where r is 
either a final local variable (including final method argu- 
ments), or the expression this, and each f; is an access 
to a final field. For example, in Figure 2, the signature 
for the method Request. getParam(Input inp) indi- 
cates that the return value has the label inp.L enforced 
on it. Therefore, the Jif 3.0 compiler can determine that 
the label of the result of the getParam method is found 
in the object inp. The additional precision of Jif 3.0 is 
needed to capture this relationship. 

This additional precision allows SIF web applications 
to express and enforce dynamic security requirements, 
such as user-specified security policies. SIF web appli- 
cations can also statically control information received 


from the currently authenticated user, whose identity is 
unknown at compile time. 

The use of dynamic labels and principals introduces 
new information flows, because which label is enforced 
on information may itself reveal information. Jif 3.0’s 
type system tracks such flows, and prevents dynamic la- 
bels and principals from introducing covert channels. 


3.3 Caching dynamic tests 


To allow efficient dynamic tests of label and principal 
relations, the Jif 3.0 runtime system caches the results 
of label and principal tests. Separate caches are main- 
tained for positive and negative results of acts-for and 
label tests. Care must be taken that the use of caches 
does not introduce unsoundness. When a principal del- 
egation is added, the negative acts-for and label caches 
are cleared, as the new delegation may now enable new 
relationships. When a principal delegation is removed, 
entries in the positive acts-for and label caches that de- 
pend upon that delegation are removed, as the relation- 
ship may no longer hold. 

When principals add or remove delegations, they 
should notify the Jif 3.0 runtime system, which updates 
the caches appropriately. Although an incorrectly or ma- 
liciously implemented principal p may fail to notify the 
runtime system, lack of notification can hurt only the 
principal p, since p (and only p) determines to whom its 
authority is delegated. 


4 Case studies 


Using SIF, we have designed and implemented two web 
applications. The first is a cross-domain information 
sharing system that permits multiple users to exchange 
messages. The second is a multi-user calendar applica- 
tion that lets users create, edit, and view events. 

This section describes the key functionality of these 
applications, their information security requirements, 
and how we reflected these requirements in the imple- 
mentations. Real applications must release information, 
reducing its confidentiality. In SIF, this is implemented 
by downgrading to a lower security label. We discuss 
and categorize downgrades that occur in the applications. 
Based on our experience, we make some observations 
about programming with information-flow control. 


4.1 Application descriptions 


Cross-domain information sharing (CDIS). CDIS ap- 
plications involve exchange of information between dif- 
ferent entities with varying levels of trust between them. 
For example, organizational policy may require the ap- 
proval of a manager to share information between mem- 
bers of certain departments. Many CDIS systems pro- 
vide an automatic process; for example, they determine 





10 


16th USENIX Security Symposium 


USENIX Association 


what approval is needed, and delay information delivery 
until approval is obtained. 

We have designed and implemented a prototype CDIS 
system. The interface is similar to a web-based email 
application. The application allows users to log in and 
compose messages to each other. A message may require 
review and approval by other users before it is available 
to its recipients. The review process is driven by a set of 
system-wide mandatory rules: each rule specifies for a 
unique sender-recipient pair which users need to review 
and approve messages. Once all appropriate reviewers 
have approved a message, it appears in the recipient’s in- 
box. Each user also has a “review inbox,” for messages 
requiring their approval or rejection. In this prototype, all 
messages are held centrally on the web server; a full im- 
plementation would be integrated with an SMTP server. 


Calendar. We have also implemented a multi-user cal- 
endar system. Authenticated users may create, edit, and 
view events. Events have a time, title, list of attendees, 
and description. Events are controlled by expressive se- 
curity policies, customizable by application users. A user 
can edit an event only if the user acts for the creator of 
the event (recall that the acts-for relation is reflexive). A 
user may view the details of an event (title, attendees, and 
description) if the user acts for either the creator or an at- 
tendee. An event may specify a list of additional users 
who are permitted to view the time of the event—to view 
an event, a user must act for the creator, for an attendee, 
or for a user on this list. 

A user’s calendar is defined to be the set of all events 
for which the user is either the creator or an attendee. 
When a user u views another user v’s calendar, u will 
see only the subset of events on v’s calendar for which 
u iS permitted to see the details or time. If the user is 
permitted to view the time, but not the details of an event, 
the event is shown as “Busy.” 


Measurements. Measurements of the applications’ code 
are given in Figure 4, including non-blank non-comment 
lines of code, lines with label annotations, and the num- 
ber of declassify and endorse annotations, which in- 
dicate intentional downgrading of information (see Sec- 
tion 4.3). 

Performance tests indicate that the overhead due to 
the SIF framework is modest. We compared the calen- 
dar case study application to a Java servlet we imple- 
mented with similar functionality, using the same back- 
end database; the Java servlet does not offer the security 
assurances of the SIF servlet. Tests were performed us- 
ing Apache Tomcat 5.5 in Redhat Linux, kernel version 
2.6.17, running on a dual-core 2.2GHz Opteron proces- 
sor with 3GB of memory. As the number of concur- 
rent sessions varies between 1 and 245, the SIF servlet 
exhibits at most a 29% reduction in requests processed 
per second, showing that SIF does not dramatically af- 


fect scalability. At peak throughput, the Java servlet pro- 
cesses 2010 requests per second, compared with 1503 
for the SIF servlet. Of the server processing time for a 
request to the SIF servlet, about 17% is spent rendering 
the Page object into HTML, and about 9% is spent per- 
forming dynamic label and principal tests. 


4.2 Implementing security requirements 


Many of the security requirements of both applications 
can be expressed using Jif’s security mechanisms, in- 
cluding dynamic principals and security labels, and thus 
automatically enforced by Jif and SIF’s static and run- 
time mechanisms. Other security requirements are en- 
forced programmatically. 


Principals. Users of the applications are application- 
specific principals (see Section 3.1). We factored out 
much functionality from both applications relating to 
user management, such as selecting users and logging 
on and off. The sharing of code across both case studies 
shows that SIF permits the design and implementation of 
reusable components. Figure 4 also shows measurements 
of the reusable user library. 

The login process works as follows: a user and pass- 
word are specified on the login screen, and if the pass- 
word is correct, the authority of the user is dynamically 
obtained via a closure; the closure is used to delegate the 
user’s authority to the session principal, who can then act 
on behalf of the now logged-in user. 

In addition to user principals, the two applications de- 
fine principals CDISApp and CalApp, representing the 
applications themselves. These model the security of 
sensitive information that is not owned by any one user, 
such as the set of application users. This information is 
labeled {p—> T ; pT}, where p is one of CDISApp or 
CalApp, and relevant portions are downgraded for use as 
needed. In particular, information in the database has this 
label. Since all information sent to and from the database 
(including data used in SQL queries) must have this la- 
bel, the authority of the application principal (CDISApp 
or CalApp) is required to endorse information sent to the 
database and to declassify information received from it. 
This provides a form of access control, ensuring that only 
code authorized by the application principal is able to ac- 
cess the database. The need to explicitly endorse data 
used in SQL queries also helps to prevent SQL injec- 
tion attacks, by making the programmer aware of exactly 
what information may be used in SQL queries. 


Dynamic security labels. The security labels of Jif 
3.0 are expressive enough to capture the case studies’ 
information-sharing requirements. In particular, we are 
able to model the confidentiality and review require- 
ments for CDIS messages by enforcing appropriate la- 
bels on the messages. For instance, suppose sender s 
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Annotated | Downgrade Functional downgrades 
Lines Lines Annotations || Access control | Imprecision | Application |] Total 
CDIS 1325 277 76 11 0 3 14 
Calendar 1779 443 73 12 0 5 17 
User 925 283 31 3 1 4 8 









































Figure 4: Summary of case studies. 


is sending a message to recipient t. The confidentiality 
policy s — t would allow both s and ¢ to read the mes- 
sage. However, before t is permitted to read the mes- 
sage, it may need to be reviewed. Suppose reviewers 
11,72; ++-;%n Must review all messages sent from s to f. 
When s composes the message, it initially has the follow- 
ing confidentiality policy: (s > t,1r1,...,T%%) U(m > 
T1,---;Tn) UW... U (tr 2 11,--+;Tn)- In this policy, 
S permits ¢ and all reviewers to read the message, and 
each reviewer permits all other reviewers to read the mes- 
sage. This label allows the message to be read by each re- 
viewer, but prevents ¢ from reading it. As each reviewer 
reviews and approves the message, their authority is used 
to remove their reader policy from the confidentiality 
policy using declassify annotations. Eventually the 
message is declassified to the policy s — t,rj,... 
which permits ¢ to read it. 

The calendar application also enforces user-defined 
security requirements by labeling information with ap- 
propriate dynamic labels. Event details have the confi- 
dentiality policy c— a ,...,@,, enforced on them, where 
cis the creator of the event and a1,...,@n are the event 
attendees. The time of an event has confidentiality pol- 
icy C> Q1,...,A4n 16> t,...,tm, where t1,...,tm 
are the users explicitly given permission by c to view the 
event time. Event labels ensure that times and details 
flow only to users permitted to see them; run-time label 
tests are used to determine which events a user can see. 


Tn 


4.3 Downgrading 


Jif prevents the unintentional downgrading of informa- 
tion. However, most applications that handle sensitive 
information, including the case study applications, need 
to downgrade information as part of their functionality. 
Jif provides a mechanism for deliberate downgrading of 
information: selective declassification [22, 26] is a form 
of access control, requiring the authorization of the own- 
ers of all policies weakened or removed by a downgrade. 
Authorization can be acquired statically if the owner of a 
policy is known at compile time; or authorization can be 
acquired at run time through a closure (see Section 3). 
Jif 3.0 programs must also satisfy typing rules to en- 
force robust declassification [40, 23, 5]. In the context 
of Jif, robustness ensures that no principal p (including 
attackers) is able to influence either what information is 
released to p (a laundering attack), or whether to release 
information to p. For a web application, robustness im- 


plies that users are unable to cause the incorrect release 
of information. Selective declassification and robust de- 
classification are orthogonal, providing different guaran- 
tees regarding the downgrading of information. 

In Jif programs, downgrading is marked by explicit 
program annotations. A declassify annotation allows 
confidentiality to be downgraded, whereas an endorse 
annotation downgrades integrity. 


Downgrading annotations are typically clustered to- 
gether in code, with several annotations needed to ac- 
complish a single “functional downgrade.” For example, 
declassifying a data structure requires declassification of 
each field of the structure [2]. The two applications had 
a combined total of 39 functional downgrades, with an 
average of 4.6 annotations per functional downgrade. 


Figure 4 shows a more detailed breakdown of the use 
of downgrading in each case study. (Details of each 
downgrade appear in Appendix A.) We found that down- 
grading could be divided into three broad categories: ac- 
cess control, imprecision, and application requirements. 


The first category is downgrades associated with dis- 
cretionary access control. Discretionary access control 
is used as a mechanism to mediate information release 
between different application components; any informa- 
tion release requires explicit downgrading. For exam- 
ple, in the calendar application, the set of all events has 
the label {CalApp — T ; CalApp <— T}; thus, down- 
grading is required both to extract events to display to 
the user, and to update events edited by the user; the 
authority of CalApp is required for these downgrades, 
and thus the downgrades serve as a form of discretionary 
access control to the event set. The choice of the label 
{CalApp — T ; CalApp < T} for the event set neces- 
sitates these downgrades; using other labels may result 
in fewer downgrades, but without the benefits of this dis- 
cretionary access control. 


Imprecision is another reason for downgrading: some- 
times the programmer can reason more precisely than 
the compiler about security labels and information flows. 
For example, suppose a method is always called with a 
non-null argument: Jif 3.0 has no ability to express this 
precondition, and conservatively assumes that accessing 
the argument may result in a Nul1PointerException. 
Since the exception may reveal information, a spurious 
information flow is introduced, which may require ex- 
plicit downgrading later. Few downgrades fall into this 
category, giving confidence that Jif 3.0 is sufficiently ex- 
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pressive. Some imprecision could be removed entirely 
by extending the compiler to accept and reason about ad- 
ditional annotations, as in JML [17]. 

Security requirements of the application provide the 
third category of downgrade reasons. These downgrades 
are inherent in the application, and cannot and should 
not be avoided. For example, in the calendar application, 
when users are added to the list of event attendees, more 
users are able to see the details of the event, an informa- 
tion release that requires explicit downgrading. 


4.4 Programming with information flow 


During the case studies’ development, we obtained sev- 
eral insights into the design and implementation of appli- 
cations with information flow control. 


Abstractions and information flow. Information flow 
analysis tends to reveal details of computations occurring 
behind encapsulation boundaries, making it important to 
design abstractions carefully. Unless sufficient care is 
taken during design, abstractions will need to be modi- 
fied during implementation. For example, we sometimes 
needed to change a method’s signature several times, 
both while implementing the method body (and discover- 
ing flows we hadn’t considered during design), and while 
calling the method in various contexts (as method invo- 
cation may reveal information to the callee, which we 
hadn’t considered when designing the signature). 


Coding idioms. We found that certain coding idioms 
simplified reasoning about information flow, by putting 
code in a form that either allowed the programmer to bet- 
ter understand it, or allowed Jif’s type system to reason 
more precisely about it. As a simple example, consider 
the following (almost) equivalent code-snippets for as- 
signing the result of method call o.m() to x, followed by 
an assignment to y: 

l.x=o.mQ; y = 42; 

2. if (o != null) { x =o.mQ); } y = 42; 

The first snippet throws a Nul1PointerException if 
o is null, and thus information about the value of o flows 
to x, and also to y (since the assignment to y is executed 
only in the absence of an exception). The information 
flow to y is subtle, and a common trap for new Jif pro- 
grammers. In the second snippet, no exception can be 
thrown (the compiler detects this with a data-flow analy- 
sis), and so information about o does not flow to y. This 
snippet avoids the subtle implicit flow to y. More gener- 
ally, making implicit information flow explicit simplifies 
reasoning about information flow. 
Declarative security policies. Many of the case stud- 
ies’ security requirements were expressed using Jif la- 
bels. SIF and the Jif compiler ensure that these labels 
(and thus the security requirements) are enforced end-to- 
end. In general, Jif’s declarative security policies can re- 


lieve the programmer of enforcing security requirements 
programmatically, and give greater assurance that the re- 
quirements are met. This argues for even greater expres- 
siveness in security policies, to allow more application 
security requirements to be captured, and to verify that 
programs enforce these requirements. 


5 Related work 


The most closely related work is Li and 
Zdancewic’s [18], which proposes a_ security-typed 
PHP-like scripting language to address information-flow 
control in web applications. Their system has not been 
implemented. It assumes a strongly-typed database 
interface, and, like SIF, ensures that applications respect 
the confidentiality and integrity policies on data sent 
to and from the database. Their security policies can 
express what information may be downgraded; in con- 
trast, the decentralized label model used in Jif specifies 
who needs to authorize downgrading. In a multi-user 
web application with mutually distrusting users, the 
concept of who a session or process is executing on 
behalf of is crucial to security. We believe that prac- 
tical information-flow control will ultimately need to 
specify multiple aspects of downgrading [29]; extending 
the decentralized label model to reason about other 
downgrading aspects is ongoing work. 

Huang et al. [14], Xie and Aiken [37], and Jovanovic 
et al. [15] all present frameworks for statically analyz- 
ing information flow in PHP web applications. Xie and 
Aiken, and Jovanovic et al. track information integrity 
using a dataflow analysis, while Huang et al. extend 
PHP’s type system with type state. Livshits and Lam [19] 
use a precise static analysis to detect vulnerabilities in 
Java web applications. Each of these frameworks has 
found previously unknown bugs in web applications. Xu 
et al. [38], Halfond and Orso [11] and Nguyen-Tuong 
et al. [25] use dynamic information-flow control to pre- 
vent attacks in web applications. All of these approaches 
use a simple notion of integrity: information is either 
tainted or untainted. While this suffices to detect and 
prevent certain web application vulnerabilities, such as 
SQL injection, it is insufficient for modeling more com- 
plex, application-level integrity requirements that arise 
in applications with multiple mutually distrusting princi- 
pals. Also, they do not address confidentiality informa- 
tion flows, and thus do not control the release of sensitive 
server-side information to web clients. 

Xu et al. [39] propose a framework for analyzing 
and dynamically enforcing client privacy requirements 
in web services. They focus on web service composi- 
tion, assuming that individual services correctly enforce 
policies. Their policies do not appear suitable for rea- 
soning about the security of mutually distrusting users. 
Otherwise, this work is complementary, as we provide 
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assurance that web applications enforce security policies. 

While there has been much recent work on language- 
based information flow (see [28, 29] for recent surveys), 
comparatively little has focused on creating real systems 
with information flow security, or on languages and tech- 
niques to enable this. No prior work has built real ap- 
plications that enforce both confidentiality and integrity 
policies while dealing securely with their interactions. 

The most realistic prior application experience is that 
of Hicks et al. [13], who use an earlier version of Jif to 
implement a secure CDIS email client, JPmail. Although 
there are similarities between JPmail and the CDIS mail 
application described here, SIF is a more convincing 
demonstration of information flow control in three ways. 
First, SIF is a reusable application framework, not just 
a single application. Second, SIF applications enforce 
integrity, not just confidentiality, and they ensure that de- 
classification is robust [5]. Third, SIF applications can 
dynamically extend the space of principals and labels and 
define their own authentication mechanisms; JPmail re- 
lies on mechanisms for principal management and au- 
thentication that lie outside the scope of the application. 

Askarov and Sabelfeld [2] use Jif to implement crypto- 
graphic protocols for mental poker. They identify several 
useful idioms for (and difficulties with) writing Jif code; 
recent extensions to Jif should assuage many of the diffi- 
culties. 

Praxis High Integrity System’s language SPARK [4] 
is based on a subset of Ada, and adds information-flow 
analysis. SPARK checks simple dependencies within 
procedures. FlowCaml [27] extends the functional lan- 
guage OCaml with information-flow security types. Like 
SPARK, it does not support features needed for real ap- 
plications: downgrading, dynamic labels, and dynamic 
and application-defined principals. 

Asbestos [9], Histar [41], and SELinux [20] are oper- 
ating systems that track information flow for confiden- 
tiality and integrity. To varying degrees, they provide 
flexible security labels and application-defined princi- 
pals. However, these systems are coarse-grained, track- 
ing information flow only between processes. Informa- 
tion flow is controlled only dynamically, which is impre- 
cise, and creates additional information flows from run- 
time label checking. By contrast, Jif checks information 
flow mostly statically, at the granularity of program vari- 
ables, providing increased precision and greater assur- 
ance that a program is secure prior to deployment. As- 
bestos has a web server that allows web applications to 
isolate users’ data from one another, using one process 
per user. All downgrades are performed by trusted pro- 
cesses. Unlike Jif, this granularity of information flow 
tracking does not permit different security policies for 
different data owned by a single user. 

Tse and Zdancewic [35] present a monadic type sys- 


tem for reasoning about dynamic principals, and certifi- 
cates for authority delegation and downgrading. Jif 3.0’s 
dependent type system for dynamic labels and princi- 
pals allows similar reasoning. Tse and Zdancewic as- 
sume that certificates are contained in the external en- 
vironment, and do not provide a mechanism to dynam- 
ically create them. Closures in Jif 3.0 can be dynam- 
ically authorized, and may perform arbitrary computa- 
tion, whereas Tse and Zdancewic’s certificates permit 
only authority delegation and downgrading. 

Swamy et al. [32] consider dynamic policy updates, 
and introduce a transactional mechanism to prevent un- 
intentional transitive flows that may arise from policy up- 
dates. In Jif, policies are updated dynamically by adding 
and removing principal delegations, and unintentional 
transitive flows may occur. Their techniques are com- 
plementary to our work, and should be applicable to Jif 
to stop these flows. 


6 Conclusion 


We have designed and implemented Servlet Informa- 
tion Flow (SIF), a novel framework for building high- 
assurance web applications. Extending the Java Servlet 
framework, SIF addresses trust issues in web applica- 
tions, moving trust out of web applications and into SIF 
and the Jif compiler. 

SIF web applications are written entirely in the Jif 3.0 
programming language. At compile time, applications 
are checked to see if they respect the confidentiality and 
integrity of information held on the server: confiden- 
tial information is not released inappropriately to clients, 
and low-integrity information from clients is not used in 
high-integrity contexts. SIF tracks information flow both 
within the handling of a single request, and over multiple 
requests—it closes the loop of information flow between 
client and server. 

Jif 3.0 extends Jif in several ways to make web appli- 
cations possible. It adds sophisticated dynamic mecha- 
nisms for access control, authentication, delegation, and 
principal management, and shows how to integrate these 
features securely with language-based, largely static, 
information-flow control. 

We have used SIF to implement two applications with 
interesting information security requirements. These 
web applications are among the first to statically enforce 
strong and expressive confidentiality and integrity poli- 
cies. Many of the applications’ security requirements 
were expressible as security labels, and are thus enforced 
by the Jif 3.0 compiler. 

As language-based information-flow control becomes 
more mature, and information-flow tools become more 
useful and robust, we expect the task of writing and un- 
derstanding programs with information-flow control to 
become easier. This work makes an important step to- 
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wards wider use of information-flow control by provid- 
ing a framework in which useful applications can be de- 
signed, implemented, and deployed. The Jif 3.0 compiler 
and run-time system and the SIF framework are all pub- 
licly available. 
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Abstract 


We propose new techniques to combat the problem of 
click fraud in pay-per-click (PPC) systems. Rather than 
adopting the common approach of filtering out seem- 
ingly fraudulent clicks, we consider instead an affirma- 
tive approach that only accepts legitimate clicks, namely 
those validated through client authentication. Our sys- 
tem supports a new advertising model in which “pre- 
mium” validated clicks assume higher value than ordi- 
nary clicks of more uncertain authenticity. Click valida- 
tion in our system relies upon sites sharing evidence of 
the legitimacy of users (distinguishing them from bots, 
scripts, or fraudsters). As cross-site user tracking raises 
privacy concerns among many users, we propose ways 
to make the process of authentication anonymous. Our 
premium-click scheme is transparent to users. It requires 
no client-side changes and imposes minimal overhead on 
participating Web sites. 


Key words: authentication, click-fraud 


1 Introduction 

Pay-per-click (PPC) metering is a popular payment 
model for advertising on the Internet. The model in- 
volves an advertiser who contracts with a specialized en- 
tity, which we refer to as a syndicator, to distribute tex- 
tual or graphical banner advertisements to publishers of 
content. These banner ads point to the advertiser’s Web 
site: When a user clicks on the banner ad on the pub- 
lisher’s webpage, she is directed to the site to which it 
points. Search engines such as Google and Yahoo are 
the most popular syndicators, and create the largest por- 
tion of pay-per-click traffic on the Internet today. These 
sites display advertisements on their own search pages in 
response to the search terms entered by users and charge 
advertisers for clicks on these links (thereby acting as 
their own publishers) or, increasingly, outsource adver- 
tisements to third-party publishers. Advertisers pay syn- 
dicators per referral, and the syndicators pass on a por- 
tion of the payments to the publishers. 


A syndicator or publisher’s server observes a “click” 
simply as a browser request for a URL associated with 
a particular ad. The server has no way to determine if 
a human initiated the action—and, if a human was in- 
volved, whether she acted knowingly and with honest 
intent. Syndicators typically seek to filter fraudulent or 
spurious clicks based on information such as the type of 
advertisement that was requested, the cost of the associ- 
ated keyword, the IP address of the request and the recent 
number of requests from this address. In this paper, we 
propose an alternative approach. Rather than seeking to 
detect and eliminate fraudulent clicks, i.e., filtering out 
seemingly bad clicks, we consider ways of authenticat- 
ing valid clicks, i.e., admitting only verifiably good ones. 
We refer to such validated clicks as premium clicks. 

Our scheme involves a new entity, referred to as 
an attestor, that provides cryptographic credentials for 
clients that perform qualifying actions, such as pur- 
chases. These credentials allow the syndicator to distin- 
guish premium clicks—corresponding to relatively low- 
risk clients—from other, general click traffic. Such classi- 
fication of clicks strengthens a syndicator’s heuristic iso- 
lation of fraud risks. 

The premium-click techniques that we describe in this 
paper are complementary to existing, filter-based tools 
for validating clicks: The two approaches can can oper- 
ate side by side. 


Organization. We begin with a problem statement and 
a description of the the related work in section 2, fol- 
lowed by a structural overview of our approach in sec- 
tion 3. In section 4, we outline our scheme and de- 
tail its technical foundations. We describe a prototype 
implementation of our scheme in section 5 and discuss 
user privacy in section 6, proposing several privacy- 
enhancing techniques. We provide a brief security analy- 
sis in section 7, and conclude in section 8. The paper ap- 
pendix describes design choices for premium-click sys- 
tems with multiple attestors. 
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2 Problem Overview and Related Work 


Click-fraud is a type of abuse that exploits the lack of 
verifiable human engagement in PPC requests in order to 
fabricate ad traffic. It can take a number of forms. One 
virulent, automated type of click fraud involves a client 
that fraudulently simulates a click by means of a script 
or bot—or as the result of infection by a virus or Tro- 
jan. Such malware typically resides on the computer of 
the user from which the click will be generated, but can 
also in principle reside on access points and consumer 
routers [8, 9, 7]. Some click-fraud relies on real clicks, 
whether intentional or not. An example of the former is 
a so-called click-farm, which is a term denoting a group 
of low-wage workers who click for a living; another ex- 
ample involves deceiving or convincing users to click on 
advertisements. An example of an unintentional click is 
one generated by a malicious cursor-following script that 
places the banner right under the mouse cursor [6]. This 
can be done in a very small window to avoid detection. 
When the user clicks, the click would be interpreted as 
a click on the banner, and cause revenue generation to 
the attacker. A related abuse is manifested in an attack 
where publishers manipulate web pages such that hon- 
est visitors inadvertently trigger clicks [4]. This can be 
done for many common PPC schemes, and simply relies 
on the inclusion of a JavaScript component on the pub- 
lisher’s webpage, where the script reads the banner and 
performs a get request that corresponds to what would be 
performed if a user had initiated a click. 


Click fraud can benefit a fraudster in at least three 
known ways: First of all, a fraudster can use click-fraud 
to inflate the revenue of a publisher. Second, a fraudster 
can employ click-fraud to inflate advertising costs for a 
commercial competitor. As advertisers generally specify 
caps on their daily advertising expenses, such fraud is es- 
sentially a denial-of-service attack. Third, a fraudster can 
modify the ranking of advertisements by a combination 
of impressions and clicks. An impression is the viewing 
of the banner, with no click; this causes the ranking of the 
associated advertisement to go down. This can be done 
to benefit own advertising programs at the cost of those 
of competitors, and to manipulate the price paid per click 
for selected keywords. 

Syndicators can in principle derive financial benefit 
from click fraud in the short term, as they receive revenue 
for whatever clicks they deem “valid.” In the long term, 
however, as customers become sensitive to losses, and 
syndicators rely on third-party auditors to lend credibility 
to their operations, click fraud can jeopardize syndicator- 
adverstiser relationships. Thus syndicators ultimately 
have a strong incentive to eliminate fraudulent clicks. 
Today they employ a battery of filters to weed out sus- 
picious clicks. These filters are trade secrets, as their dis- 


closure might prompt new forms of fraud [10]. To give 
one example, though, it is likely that syndicators use IP 
tracing to determine if an implausible number of clicks 
is originating from a single source. While heuristic fil- 
ters are fairly effective, they are of limited utility against 
sophisticated fraudsters, and subject to degraded perfor- 
mance as fraudsters learn to defeat them. 


3 Structural Overview 


Authentication. Our premium-click scheme is based 
on authentication of requests via cryptographic attesta- 
tions on client behavior. We refer to these attestations as 
coupons. While a coupon could be realized straightfor- 
wardly using traditional third-party cookies, such cook- 
ies are so commonly blocked by consumers that their use 
is often impractical. Our scheme could alternatively in- 
volve traditional first-party cookies dispensed and har- 
vested by a central authority. This architectural ap- 
proach, however, presents limitations that we explain in 
depth in section 4.1. As we explain, we instead focus in 
this paper on the alternative mechanism of cache cookies. 

Our premium-click scheme has two distinctive as- 
pects: 


1. Pedigree: Our scheme relies on designated Web 
sites called attestors to identify and label clients that 
appear to be operated by legitimate users—as op- 
posed to bots or fraudsters. For example, an attestor 
might be a retail Web site that classifies as legiti- 
mate any client that has made at least $50 of pur- 
chases. (Financial commitment here corroborates 
legitimate user behavior.) We refer to such clients, 
the producers of premium clicks in our scheme, as 
premium clients. In a loose sense, we propose the 
creation of an implicit reputation network to com- 
bat click-fraud, much like the seller reputation on 
eBay [3]. 


2. Traffic caps: Our scheme supports validation of 
clicks from clients that have not produced an ex- 
cessive degree of click-traffic and thereby indicated 
possible malicious activity. In our approach, a 
click is only regarded as valid if accompanied by 
a coupon. Thus, we can detect multiple requests 
from the same origin by keeping track of coupon 
presentation. In the standard approach in which at- 
testations like coupons do not play a role, detection 
of same-source traffic is more challenging, and of- 
ten depends upon coarser origination data, such as 
IP addresses, or more fragile markers of continuity, 
such as session identifiers. 
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Architecture. In a traditional scheme, as a user clicks 
on a banner placed on the site of a publisher, the cor- 
responding advertisement is downloaded from the ad- 
vertiser and the transaction recorded by the syndicator. 
Later, the syndicator bills the advertiser and pays the 
publisher. 


Under the model of premium clicks, there are addi- 
tional tasks carried out: As a user performs a qualified 
action (such as a purchase), the corresponding attestation 
is embedded in his browser by an attestor. This attesta- 
tion is released to the syndicator when the user clicks 
on a banner. The release can be initiated either by the 
syndicator or the advertiser. (Our prototype relies on 
syndicator triggering of coupon release.) The syndica- 
tor can pay attestors for their participation in a number 
of ways, ranging from a flat fee per time period to a pay- 
ment that depends on the number of associated attesta- 
tions that were recorded in a time interval. To avoid a 
situation where dishonest attestors issue larger number 
of attestations than the protocol prescribes (which would 
increase the earnings of the dishonest attestors), it is pos- 
sible to appeal to standard auditing techniques. 


Challenges. Our approach to premium clicks gives rise 
to two technical challenges. First, we must securely vali- 
date premium clients and their associated clicks. In other 
words, we must ensure that adversaries cannot imperson- 
ate premium clients or forge premium clicks. For this 
purpose, we apply basic cryptographic tools for data in- 
tegrity. Second, we must protect the privacy of clients. 
While we do want syndicators to be able to authenticate 
clients, we do not want syndicators to be able to track 
them, learn their identities, or harvest side information 
about their browsing patterns. Toward this end, we pro- 
pose ways in which coupons may be created as essen- 
tially anonymous credentials. 

Of course, our techniques do not prevent misuse of 
coupons by clients that are “good,” i.e., controlled by 
honest users, and then turn “bad,” e.g., become infected 
with malware. By identifying the sources of clicks, how- 
ever, and making traffic caps more effective, coupons in 
our scheme still offer some protection against fraud even 
in such cases. 

It is important to observe that existing filtering meth- 
ods cannot in general employ cookies/coupons to detect 
fraudulent clicks. That is because filtering is an exclu- 
sionary process: It seeks to identify and eliminate “bad” 
clicks. If a cookie were used to mark and exclude certain 
types of “bad” users, fraudsters could simply remove the 
cookies from their browsers. In contrast, because our 
premium-click scheme is distinguishing, i.e., it only ac- 
cepts “good” clicks, it can benefit from the use of cook- 
ies/coupons. Cookies serve to mark “good” users. 


4 A Premium-Click Scheme 


In a world of perfect transparency, in which a syndica- 
tor knew the (real-world) identity of all users clicking on 
ads, click fraud would be much more manageable. In 
such a world, it would be easier to identify misbehavior 
by a real user—e.g., implausibly many clicks—as well 
as clicks initiated by bogus users or bots. A syndica- 
tor could go further, and reference databases containing 
profiles on the users who clicked on its published ads. 
The syndicator could even create a highly refined pricing 
structure based on a user’s predicted value as a potential 
consumer, with differential compensation for publishers. 
Our premium-click protocol diverges from this ideal in 
two senses: 


e Partial knowledge: Given the fragmented nature 
of databases on user behavior and the privacy con- 
cerns attendant on user profiling, our overall pro- 
filing goal is modest. We would like to enable a 
syndicator only to determine that a click originates 
with a true human user with probable honest intent. 
We do not mainly focus on stronger differentiation 
among users, although our protocols could support 
this goal. 


e The browser as carrier: Rather than relying on a 
central data repository, we rely on users’ browsers 
to convey information among participating sites. 
This approach helps eliminate engineering com- 
plexity and protect user privacy. 


We design our premium-click scheme to support out- 
sourced PPC advertising. It can equally well secure 
against click fraud when ads are published directly on 
search engines: We need simply treat the syndicator and 
publisher as the same entity. The steps in our scheme are 
as follows and are illustrated in Figure 1. For simplic- 
ity, we assume a single syndicator S and attestor A. (We 
discuss the case of multiple attestors in the appendix.) 


1. Marking: Based on its criteria for user validation, 
the attestor identifies a visiting client as legitimate. 
The attestor then “marks” the client. It does so by 
caching in the client’s browser a coupon 7, a kind 
of cryptographic token. 


2. Click / coupon release: When a user clicks on a 
publisher’s advertisement in a browser, the user’s 
browser is directed to a URL on the syndicator’s 
site. This URL includes the publisher’s identity 
ID yy» and the identity of the advertisement that was 
clicked with [D qa. The syndicator then causes the 
browser to release its coupon ¥y simultaneously with 
IDpup and IDga.2 We let C = (7,1 Dpus, Daa) 
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denote the released triple. We shall henceforth re- 
fer to y or C alternately as a “coupon,” according to 
context. 


3. Coupon checking: On receiving a triple C = 
(7, [Dpuv, [Daa), the syndicator checks that 7 is a 
(cryptographically) well formed coupon, as we de- 
scribe in depth later. The syndicator also checks that 
the coupon has not been over-used, i.e., that C’ has 
not been submitted an excessive number of times in 
the recent past. (What constitutes “excessive” sub- 
mission is a policy decision.) 


4. Reward: If the syndicator successfully verifies that 
C represents a valid premium click, then the syndi- 
cator pays the publisher accordingly. 


Of course, the publisher might embed additional in- 
formation in C’, e.g., a timestamp, etc. Moreover, a 
user’s browser might in fact contain multiple coupons 
V1;72,--- from different attestors, a possibility that we 
discuss below. A single computer may have multiple 
users, of course. If they each maintain a separate ac- 
count, then their individual browser instantiations will 
carry user-specific coupons. When users share a browser, 
the browser may carry coupons if at least one of the users 
is validated by an attestor. While validation is not user- 
specific in this case, it is still helpful: A shared machine 
with a valid user is considerably more likely to see honest 
use than one without. 


We now detail the technical foundations of our 
scheme. We assume here that the browser of a given 
user carries at most one coupon. We address the case 
of multiple coupons in the appendix. 


4.1 Coupon caching 


Our first technical design choice is the transport medium 
for coupons. To ensure its correct association with the 
browser that created it, a coupon is best communicated 
as a cached browser value (rather than through a back 
channel). At the same time, it is important to ensure that 
coupons be set such that only the syndicator can retrieve 
them, and fraudsters cannot easily harvest them. 
Third-party cookies are the most obvious way to in- 
stantiate coupons. A third-party cookie is one set for a 
domain other than the one being visited by the user; thus, 
a coupon could be set as a third-party cookie. Because 
third-party cookies have a history of abusive application, 
however, users regularly block them. First-party cook- 
ies are an alternative mechanism. If an attestor redirects 
users to the site of a syndicator and provides user-specific 
or session-specific information in the redirection, then 
the syndicator can implant a coupon in the form of a 


first-party cookie for its own, later use. Redirection of 
this kind, however, can be cumbersome, particularly if 
an attestor has relationships with multiple syndicators. 

Cache cookies [5], particularly the TIF-based variety, 
offer an attractive alternative. An attestor can embed a 
coupon in a cache-cookie that is tagged for the site of 
a syndicator, i.e., exclusively readable by the syndica- 
tor. In their ability to be set for third-party sites, cache 
cookies are similar in functionality to third-party cook- 
ies. Cache cookies have a special, useful quirk, though: 
Any Web site visited by a user can cause them to be re- 
leased to the site for which they are tagged. (Thus, as 
we shall see, it is important to authenticate the site initi- 
ating their release from a user’s browser.) Cache cook- 
ies, moreover, function even in browsers where ordinary 
cookies have been blocked. Cache cookies are therefore 
our preferred medium for coupons. 

Briefly, a TIF-based cache cookie works as follows. 
Suppose we wish to set a cache cookie bearing value 
for release to Web site www.S.com. The cache cookie, 
then, assumes the form of an HTML page ABC.html that 
requests a resource from www.S.com bearing the value 
y. For example, ABC.html might display a GIF image 
of the form http://www.S.com/y.gif. Observe that any 
Web site can create ABC.html and plant it in a visit- 
ing user’s browser. Similarly any Web site that knows 
the name of the page/cache-cookie ABC.html can ref- 
erence it, causing www.S.com to receive a request for 
y.gif. Only www.S.com, however, can receive the cache 
cookie, i.e., the value 7, when it is released from the 
browser. 


4.2 Coupon authentication 


Ensuring against fraudulent creation or use of coupons 
is a key challenge in our scheme. Only attestors should 
be able to construct valid coupons. Coupons must there- 
fore carry a form of cryptographic authentication. While 
digital signatures can in principle offer a flexible way to 
authenticate coupons, their computational costs are prob- 
ably prohibitively expensive for a high-traffic, potentially 
multi-site scheme of the type we propose here. Message- 
authentication codes (MACs), a symmetric-key analog of 
digital signatures, are a more practical alternative. 
Suppose that the attestor A and syndicator S share a 
symmetric key k. (This key may be established out of 
band or using existing secure channels.) Let M AC;,(m) 
represent a strong message authentication code, e.g., 
HMAC [2], computed on a suitably formed message m. 
It is infeasible for any third party, e.g., an adversary, to 
generate a fresh MAC on any message m. Consequently, 
if a coupon assumes the form y = m || MAC;,(m) for a 
bitstring ™m that is unique to the visit of a client to the site 
of an attestor, then the coupon can be copied, but cannot 
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Figure 1: (1) Alice visits attestor A.com, spends money, and receivs coupon value C' = +¥. (2) Alice visits publisher 


P.com and clicks on an ad. (3) Alice’s browser transmits coupon C' = 


(7, LDpub, 1Daa) to syndicator S.com. (4) 


S.com pays P.com for Alice’s premium click. (S.com also redirects Alice’s browser to the entity that created the ad.) 


be feasibly modified by a third party. The value m might 
be a suitably long (say, 128-bit) random nonce generated 
by A. We propose some privacy-protecting alternative 
formats for m below. 


4.3. Publisher identification/authentication 


In addition to ensuring that a coupon is authentic, a syn- 
dicator must also be able to determine what publisher 
caused it to be released and is to receive payment for 
the associated click. Recall from above that a coupon 
takes the form C = (7,IDpus, Daa), where IDyup is 
the identity of the publisher and J D,q identifies the ad- 
vertisement clicked. In order to create a full coupon, we 
must append I Dpy» and [Dag to ¥ as it is released. To 
do so, we can enhance a cache cookie webpage X.html 
to include the document referrer, i.e., the tag that identi- 
fies the webpage that causes its release. (In our scheme, 
this webpage is a URL on the syndicator, www.S.com, 
where both J Dag and I. Dpxp are in the URL.) For exam- 
ple, X.html might take the following form: 


<html><body> 

<script language="JavaScript"> 

//Determine referring webpage r 

// (which contains IDgq and IDpu) : 

var r = escape (document.referrer) ; 

//Write HTML to release the coupon y.gif: 

document .write(’<img src="http://S.com/’ 
+ 'y.gif?ref='’ + r+ '"/>'); 

</script> </body> </html> 


Now when the syndicator’s site page with a URL 
containing ID,» and IDaq references X.html, the syn- 


dicator www.S.com receives a request for the resource 
7. gif ?ref=www.S.com%3fad%3d (I Daa) %26pub%3d(I Dyub) 
(the value of the ref querystring variable in this resource 
request is the referrer, or page that triggered X.html 
to load, but encoded so it can appear in the URL). 
In essence, he receives a request for an image 7¥.gif, 
and is provided one querystring-style parameter con- 
taining the IDs of the advertisement and publisher. 
This string conveys the full desired coupon data 
C = (7; LDpub, IDaa)- 


Remark: In cases where JavaScript is disabled by a 
client, an alternative approach is possible. An attestor 
can create not one cache cookie, but an array of cache 


cookies on independently created values 7), sob Wo 


(1) (1) 


and y;°,---,Y%- To encode an k-bit publisher value 


IDyuv = 61 || ..- || 6%, the publisher releases cache 
cookies sone ponaine to my), 2), Of course, 
this method is somewhat more cumbersome than use of 
document-referrer strings, as it requires the syndicator to 
receive and correlate k distinct cache cookies for a single 
transaction. 


4.4 Freshness 


Authentication alone is insufficient to guarantee valid 
coupon use. It is also imperative to confirm that a coupon 
is fresh, that is, that a client is not replaying it more 
rapidly than justified by ordinary use. 

To ensure coupon freshness, a syndicator may main- 
tain a data structure T = {RM,..., R()} recording 
coupons received within a recent period of time (as deter- 
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mined by syndicator policy). A record R™ can include 


a ad 


an authentication value y“, publisher identity JD 
identifier [ D®, and a time of coupon receipt t, 

When a new coupon C = (7,1Dpu»,1Daa) is re- 
ceived at time t, the syndicator can check whether there 
exists aC = (7,IDpu,IDaa) € T with time- 
stamp 1, fte-1% < Treplay» for Some system pa- 
rameter T;epiay determined by syndicator policy, then 
the syndicator might reject C' as a replay. Similarly, 
the syndicator can set replay windows for cross-domain 
and cross-advertisement clicks. For example, if C = 


(7, De, 1pD®), where [Daa 4 1p®, i.e., it appears 
that a given user has clicked on a different ad on the same 
site as that represented by C’,, the syndicator might imple- 
ment a different check t — t® < Terossclick to determine 
that a coupon is stale and should be rejected. Since a 
second click on a given site is more likely representative 
of true user intent than a “doubleclick,” we would expect 
Tcrossclick < Treplay: 

Of course, many different filtering policies are pos- 
sible, as are many different data structures and mainte- 
nance strategies for T’. 


5 Prototype Implementation 


We implemented a prototype of our premium-click 
scheme. Four websites at separate IP addresses provide a 
simulated advertiser, publisher, attestor, and syndicator. 
The web sites are served by Apache 2.0.58, and server- 
side scripted with PHP 5.1.6. The database for click, ad, 
and coupon data is MySQL 5.0.26. 


Advertiser. The prototype advertiser consists of two 
fabricated product pages designed as destinations for a 
user that clicks on a web ad. The only other duties of 
an Advertiser in the premium clicks system are to submit 
the ads to the syndicator, and then pay for billed clicks. 


Publisher. The prototype publisher is a simple site that 
embeds ads, served by the syndicator, in iframes. Many 
widespread advertisement schemes (including Google’s 
AdSense) use this technique; others simply write directly 
to a publisher’s page, submitting their advertisements to 
the same origin as the publisher, thus making the scheme 
vulnerable to more click-fraud techniques [4]. 


Attestor. The prototype attestor is a simple service that 
provides a login box. When a user of the service provides 
a valid ID and password, he is provided an internal page 
that serves a cache cookie to the visitor’s browser. This 
simple HTML file is transmitted from the attestor to the 
visitor only once after login. Any subsequent requests 


for the cache-cookie URL are replied to with an HTTP 
304 “not modified” response. This forces the browser to 
use a cached version of the cookie if it exists, and does 
not provide one to browsers lacking a cached version of 
the cookie. 

The cache cookie served by the attestor references an 
image hosted on the syndicator. The URL used to request 
the image is created by JavaScript when the cookie’s 
HTML is rendered, and contains the secret y (which is 
generated when the cache cookie is set) as well as the re- 
ferrer page, i.e., whichever page caused the cache cookie 
to load. Later, when the cookie is loaded in conjunction 
with a click, the URL of the referrer will reveal the ID 
of the publisher and the ID of the advertisement that was 
clicked. 

The attestor needs to create and serve these cache 
cookies when the user logs in, so additional processing is 
required. However, creating a secret value takes very lit- 
tle time, and the cookie can be served in a hidden iframe. 
The result is no difference in experience for the user, and 
only a trivial amount of work for the attestor’s servers. 


Syndicator. Of all the entities, the syndicator does the 
most work. It receives coupons released by the cache 
cookies (in the form of requested images), verifies the 
secrets in the coupons, and records clicks. Additionally, 
each ad click must be directed “through” the syndicator, 
so it must also serve a transfer page to direct the client’s 
web browser to the advertiser’s site. This is an ordinary 
flow of traffic in ad-serving systems that briefly delegates 
control to the syndicator who records clicks. 


e Database. The syndicator hosts a MySQL database 
to house the advertisement data (content, adver- 
tiser’s ID, URL), click-through log (each click as 
it occurs, including advertisement ID and publisher 
ID), as well as a log of received coupons. Since 
the released coupon history needs to be saved and 
searched, we chose to use a database to ease devel- 
opment. 


e Processing Clicks. When an advertisement is 
clicked, the client’s browser navigates to the 
syndicator’s site, bringing along the advertisement 
ID and the publisher ID. For example: 
http://syndicator/click.php?ad=x&pub=y. 
The syndicator then records the click, and responds 
with a web page that causes cache cookies from 
all attestors to load. (We only implemented one 
attestor, so one iframe is rendered with its content 
being the attestor’s cache cookie. When there are NV 
possible attestors, N iframes are used.) Attestors’ 
cache cookies not available in the browser’s cache 
are simply not loaded. Coupons are released by the 
client’s browser from any attestor’s cache cookies. 
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e Receiving Coupons. Coupons are received by the 

syndicator in the form of requests for an image 
called “coupon.gif”. When this is requested, it is 
accompanied by a querystring. For example: 
http://syndicator/coupon.gif? 
secret=y&ref=z. 
The ref variable in the query string reflects both 
the publisher ID /D,,,,» and the advertisement J Dag 
that was clicked. Receiving this request, the syndi- 
cator records the time, secret y and referrer x in the 
database, and then serves a tiny image back to the 
client. HTTP headers are provided that force the 
client’s browser always to request this image, and 
not load it from cache. The purpose of this process 
is to ensure the coupons are always freshly deliv- 
ered, and not loaded from browser cache. 


e Analyzing Clicks. On the syndicator’s click-through 
page where the attestors’ cache-cookie iframes are 
present, a small delay is forced by JavaScript to 
allow the coupons transit time, since they load 
asynchronously in iframes. Immediately follow- 
ing that, the server decides if a click should be 
classified as “premium.” This is done by look- 
ing through the coupon database for recently re- 
leased coupons corresponding to the advertisement 
that was clicked. Time between when the click 
was recorded and when the coupons arrived is 
noted, and only coupons within a pre-set window 
(Treplays Sixty seconds in our prototype) are consid- 
ered in determining the premium status. If coupons 
are present and the secrets are valid, the click is 
recorded as “premium.’* Otherwise it is recorded 
as a general-class click. 


In a production system, click analysis would be 
done after redirecting the client by adding it to a 
processing queue. 


6 Privacy 


In deploying our premium-click scheme with multiple at- 
testors, A1,...,A,, it would be natural for a syndicator 
to share a unique key k; with each attestor A;. Given 
such independent attestor keys {k;}, though, a coupon 
created by A; conveys and therefore reveals the fact that 
a user has visited the Web site of A;. Observe, how- 
ever, that in our scheme a publisher triggers the release 
of a coupon from the browser of a visiting user, but does 
not see the coupon. The syndicator receives the coupon, 
but does not directly interact with the user. In effect, the 
syndicator receives the coupon blindly. While the syndi- 
cator does learn the IP address of the user, this is infor- 
mation that is typically already available: The only ad- 
ditional information that the syndicator learns is whether 


or not the user has received an attestation. Thus, coupons 
naturally decouple information about the browsing pat- 
terns of users from the identities and browsing sessions 
of users. This is an important, privacy-preserving fea- 
ture. 

Such decoupling occurs in the case when ads are out- 
sourced, that is, when the syndicator and publisher are 
separate. When the syndicator and publisher are iden- 
tical, i.e., when a search engine displays its own adver- 
tisements, coupons may be linked to users, and there- 
fore leak potentially sensitive information. A couple of 
privacy-enhancing measures are possible. To limit the 
amount of leaked browsing implementation, our scheme 
may employ a multiple-coupon technique discussed in 
depth in Appendix A. Alternatively, attestors may share 
a single key & (or attestors may have overlapping sets of 
keys). In this case, a MAC does not reveal the identity 
of the attestor that created it. If a coupon y = m || 
MAC;,(m) is created, as we propose, with a random 
nonce m, then it conveys no information about a user’s 
identity. In principle, however, it would be possible for 
an attestor to embed a user’s identity in m, thereby trans- 
mitting it to the syndicator. This transmission could even 
be covert: A ciphertext on a user’s identity, i.e., an en- 
cryption thereof, will have the appearance of a random 
string. Proper auditing of the policy and operations of 
the attestor or syndicator would presumably be sufficient 
in most cases to ensure against collusive privacy infringe- 
ments of this kind. 

As an alternative, m might be based on distinctive, but 
verifiably non-identifying values. For example, ™m might 
include the IP address> and/or timestamp of the client 
to which an attestor issues a coupon—perhaps supple- 
mented by a small counter value.® A client could then 
verify that ™ was properly formatted, and did not encode 
the user’s identity. Of course, MAC;(m) itself might 
then embed the user’s identity. It is possible, however, to 
eliminate the possibility of a covert channel in the MAC 
by periodically refreshing & and publicly revealing old 
values. 


Remarks: 


e There are good business motivations for attestors 
not merely to validate, but also to classify users. For 
example, an retailer might not merely indicate in 
a coupon that a client has spent enough to justify 
validation, but also provide a rough indication of 
much the client has spent (“low spender,” “medium 
spender,’ “profligate”). Advertisers might then be 

charged on a differential basis according to the as- 

sociated, perceived value of a client. Such classifi- 
cation would create a new dimension of privacy in- 
fringement. In the outsourcing case, where coupons 
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are decoupled from user identities, this approach 
might meet with user and regulator acceptance. In 
the case where the syndicator publishes advertise- 
ments, and coupons are linked to users, privacy is 
of greater concern. As advertisers will necessarily 
learn the syndicator’s differential pricing scheme, 
there will be at least some transparency. 


e In principle, public-key digital signatures offer 
more flexible privacy protection for coupon ori- 
gins than MACs. For example, group signatures, 
e.g., [1], permit the identity of a signing attestor 
to be hidden in the general case, but revoked by a 
trusted entity in the case of system compromise. In 
a high traffic advertising system, however, public- 
key cryptography would be prohibitively resource- 
intensive. 


7 Security Analysis 


Without possession of an attestor key, an adversary can- 
not feasibly forge new coupons, thanks to our use of 
MACs. An adversary could still bypass our scheme in 
several ways: 


e Direct publisher fraud: Using a slight modifica- 
tion of the proposed solution, the publisher could 
cause release of coupons even when users do not 
click on ads. 


e Indirect publisher fraud: A dishonest Web site 
could re-direct users to the publisher’s site. 


e Malware-driven clicks: A virus or widely spread 
Trojan could either surreptitiously direct a user’s 
browser to a Web site and simulate a click or else 
steal a coupon from the browser for use on another 
platform. 


All of these attacks are possible in existing click-fraud 
schemes. The various techniques used to address them 
today are equally applicable to premium clicks. For ex- 
ample, a syndicator can direct its own client machines to 
a publisher’s site to determine if the publisher is generat- 
ing fraudulent clicks. Indeed, our premium-click scheme 
makes detection of misbehavior easier, as it permits a 
syndicator to “mark” a client coupon and therefore di- 
rectly monitor the traffic generated by the client and even 
detect the emergence of stolen coupons. 

An adversary can also try to exploit the special char- 
acteristics of our scheme as follows: 


e Posing as an attestor: An adversary might either 
establish itself as an attestor or compromise the key 
of an existing attestor. If the syndicator sets appro- 
priate policies for creating attestors, then it should 


be difficult for an adversary to pose as one. At- 
testors are likely, in any case, to be a more exclusive 
class of Web site than publishers or even advertisers. 
Moreover, in the case where MAC keys are attestor- 
specific, the syndicator can individually monitor the 
traffic generated by each attestor, making fraud de- 
tection easier. 


e Compromise of an attestor key: An adversary can 
attempt to learn the MAC key of an existing attestor. 
The difficulty of this form of attack depends on the 
security of the attestor’s Web site. MAC keys for 
premium clicks may be protected using many of the 
same measures employed to secure SSL keys and 
other cryptographic secrets. 


e Coupon harvesting: An adversary could harvest 
coupons from attestors by creating accounts or 
clients that meet their validation criteria. By estab- 
lishing appropriate policies for validation by its at- 
testors, the syndicator can attempt to attach a finan- 
cial cost to this form of fraud in excess of the gains 
that a fraudster might reap from it. 


Auditing 


Since the syndicator is ultimately in control over decid- 
ing which clicks should be considered “premium” (and 
earns more when clicks are premium), publishers and 
advertisers may accuse the syndicator of improperly in- 
flating the percentage of clicks considered premium. To 
solve this problem, an additional entity called an auditor 
can be contracted to watch the coupons that are released, 
and verify the premium-status judgement of the syndica- 
tor. The auditor would not be rewarded based on click 
traffic, so it would have no incentive to inflate or deflate 
the number of premium clicks from those that are legiti- 
mate. 

The cache cookies set by attestors can be crafted 
so that, when an advertisement’s URL is clicked, the 
coupon C = (y7,IDpuv,1Daa) is released both to the 
syndicator and to the auditor who maintains an indepen- 
dent database. When the syndicator’s numbers are con- 
tested, the coupons recorded by the auditor can be used 
to recompute the number of premium clicks for a given 
advertisement or publisher, and compared to the syndi- 
cator’s calculation. 


8 Conclusion 


In contrast to today’s heuristic filtering methods for elim- 
inating “bad” clicks, our premium-click scheme relies on 
a foundation of cryptographic authentication to validate 
“good” clicks. Premium clicks are by no means a cure- 
all for fraud, and are themselves subject to attack. The 
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value of premium clicks lies in the way that they provide 
new, cryptographically authenticated visibility into click 
traffic, and thus a new, stronger platform for combating 
click fraud. 

While premium clicks could in principle supplant cur- 
rent filtering schemes entirely, they are attractive in that 
they can be deployed in a complementary fashion along- 
side existing systems. We have proposed a new advertis- 
ing model in which advertisers pay a higher charge for 
premium clicks. We believe that such a scheme might 
be launched experimentally by a syndicator with mini- 
mal impact on existing business and then expanded as 
its success warrants. Thus premium clicks promise offer 
not only a new approach to click fraud, but one with a 
practical path to fruition. 
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Notes 


'This research was performed by the author at RavenWhite Inc. 

2The reason for having the syndicator trigger coupon-release is 
twofold: (1) To prevent JavaScript-based click automation, ads today 
are often rendered inside an iframe whose source is loaded from the 
syndicator’s site and (2) To eliminate the need for JavaScript on the 
publisher’s site. 

3 An authenticator could create a list X of random codes and transfer 
it to the syndicator via a backchannel, but this would not be efficient 
(and would also eliminate some of the privacy properties we would 
like to achieve). 

“In a multi-attestor environment, where clients may carry and re- 
lease multiple coupons, the syndicator needs some mechanism to de- 
termine which coupons correspond to a given client. A simple option is 
to attach a fresh, random number (nonce) to the links in each rendered 
advertisement. The nonce will attach itself to all of the coupons that a 
client releases in a given click. 

Inclusion of an IP address in a coupon also has some security bene- 
fits. In the case where the syndicator publishes its own ads, it can check 
that a client’s presented IP address is consistent with the IP address in 
the coupon, e.g., it originates with the same service provider. 

6A counter might still embed a covert channel, but, if the size of 
the channel might be made small enough to alleviate the problem of 
privacy infringement significantly. 


A Multiple Attestors 


User privacy in our premium-click scheme depends upon 
how the value + is formed, and on the number and con- 
tent of the coupons cached in a user’s browser. Let us 
now therefore consider a system with multiple attestors, 
Aj,...,Aq. Each attestor auth; shares a key kj with 
the syndicator. We now describe the technical challenges 
that arise with multiple attestors. 


Multiple coupons. The first problem we encounter in 
a system with multiple attestors is the difficulty of man- 
aging multiple cache cookies across different domains. 
A cache-cookie system can involve caching of a set of 
j different webpages X1,X2,...,Xj; in a given user’s 
browser, each webpage serving as a slot for a distinct 
cache cookie. Two difficulties arise, however. The first 
is that a site seeking to release a set of cache cookies (i.e., 
the publisher) cannot determine what slots in a user’s 
browser actually contain cache cookies. The only way 
for the publisher to release all cache cookies is to call 
all 7 webpages. The second is that a site seeking to set a 
cache cookie, i.e., an attestor, cannot determine if a given 
slot has been filled. If the attestor plants a cache cookie 
in a slot that already contains one, the previously planted 
cache cookie will be effaced. 

The simplest way to circumvent these difficulties in 
our premium-clicks scheme is to manage only a single 
slot, that is, to maintain only a single cache cookie in a 
given user’s browser. Only the cache cookie planted most 
recently by an attestor will then persist. Provided that the 
syndicator regards all attestors as having equal authority 
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in validating users, this approach does not result in any 
service degradation. 

If, however, the syndicator desires the ability to har- 
vest multiple coupons, then attestors must use multiple 
slots. One possible approach is to maintain an individ- 
ual slot for each attestor, i.e., to let 7 = q. If the num- 
ber of attestors is small, this may be workable. Alter- 
natively, attestors may plant coupons in random slots, 
sometimes supplanting previous coupons, or subsets of 
attestors may share slots. The syndicator might, for ex- 
ample, assign different weight to attestors, according to 
the anticipated reliability of their attestations; attestors 
with the same rating might share a slot. 


Keying. One approach to management of attestor keys 
is to assign an identical key k to all attestors, i.e., let 
k, = kg... = k. While this approach has the merit of 
simplicity, it has the disadvantage of rendering tracing 
and key-revocation difficult. 

It is preferable, therefore to create attestor keys {k;} in 
an independent manner. In this case, a coupon y = m || 
M AC;,(m) is cryptographically bound to the attestor 
that created it. That is, only attestor A;, with its knowl- 
edge of k;, can feasibly create 7 of this form. To en- 
able the syndicator to determine the correct key for ver- 
ification of the MAC, the coupon must be supplemented 
with 2, the identity of the authenticator. For example, we 
might let m = i || r, where r is a random nonce. 
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Abstract 


This paper explores the use of execution-based Web 
content analysis to protect users from Internet-borne 
malware. Many anti-malware tools use signatures to 
identify malware infections on a user’s PC. In contrast, 
our approach is to render and observe active Web con- 
tent in a disposable virtual machine before it reaches the 
user’s browser, identifying and blocking pages whose be- 
havior is suspicious. Execution-based analysis can de- 
fend against undiscovered threats and zero-day attacks. 
However, our approach faces challenges, such as achiev- 
ing good interactive performance, and limitations, such 
as defending against malicious Web content that contains 
non-determinism. 

To evaluate the potential for our execution-based 
technique, we designed, implemented, and measured 
a new proxy-based anti-malware tool called SpyProxy. 
SpyProxy intercepts and evaluates Web content in tran- 
sit from Web servers to the browser. We present the 
architecture and design of our SpyProxy prototype, fo- 
cusing in particular on the optimizations we developed 
to make on-the-fly execution-based analysis practical. 
We demonstrate that with careful attention to design, an 
execution-based proxy such as ours can be effective at 
detecting and blocking many of today’s attacks while 
adding only small amounts of latency to the browsing ex- 
perience. Our evaluation shows that SpyProxy detected 
every malware threat to which it was exposed, while 
adding only 600 milliseconds of latency to the start of 
page rendering for typical content. 


1 Introduction 


Web content is undergoing a significant transforma- 
tion. Early Web pages contained simple, passive con- 
tent, while modern Web pages are increasingly active, 
containing embedded code such as ActiveX components, 
JavaScript, or Flash that executes in the user’s browser. 
Active content enables a new class of highly interactive 
applications, such as integrated satellite photo/mapping 


systems. Unfortunately, it also leads to new secu- 
rity threats, such as “drive-by-downloads” that exploit 
browser flaws to install malware on the user’s PC. 


This paper explores a new execution-based approach 
to combating Web-borne malware. In this approach we 
render and execute Web content in a disposable, iso- 
lated execution environment before it reaches the user’s 
browser. By observing the side-effects of the execution, 
we can detect malicious behavior in advance in a safe 
environment. This technique has significant advantages: 
because it is based on behavior rather than signatures, 
it can detect threats that have not been seen previously 
(e.g., zero-day attacks). However, it raises several cru- 
cial questions as well. First, can execution-based analy- 
sis successfully detect today’s malware threats? Second, 
can the analysis be performed without harming browser 
responsiveness? Third, what are the limitations of this 
approach, in particular in the face of complex, adversar- 
ial scripts that contain non-determinism? 


Our goal is to demonstrate the potential for execution- 
based tools that protect users from malicious content as 
they browse the Web. To do this, we designed, proto- 
typed, and evaluated a new anti-malware service called 
SpyProxy. SpyProxy is implemented as an extended Web 
proxy: it intercepts users’ Web requests, downloads con- 
tent on their behalf, and evaluates its safety before re- 
turning it to the users. If the content is unsafe, the proxy 
blocks it, shielding users from the threat. Our intention 
is not to replace other anti-malware tools, but to add a 
new weapon to the user’s arsenal; SpyProxy is comple- 
mentary to existing anti-malware solutions. 


SpyProxy combines two key techniques. First, it ex- 
ecutes Web content on-the-fly in a disposable virtual 
machine, identifying and blocking malware before it 
reaches the user’s browser. In contrast, many existing 
tools attempt to remove malware after it is already in- 
stalled. Second, it monitors the executing Web content 
by looking for suspicious “trigger” events (such as reg- 
istry writes or process creation) that indicate potentially 
malicious activity [28]. Our analysis is therefore based 
on behavior rather than signatures. 
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SpyProxy can in principle function either as a service 
deployed in the network infrastructure or as a client-side 
protection tool. While each has its merits, we focus in 
this paper on the network service, because it is more 
challenging to construct efficiently. In particular, we de- 
scribe a set of performance optimizations that are neces- 
sary to meet our goals. 

In experiments with clients fetching malicious Web 
content, SpyProxy detected every threat, some of which 
were missed by other anti-spyware systems. Our eval- 
uation shows that with careful implementation, the per- 
formance impact of an execution-based malware detector 
can be reduced to the point where it has negligible effect 
on a user’s browsing experience. Despite the use of a 
“heavyweight” Internet proxy and virtual machine tech- 
niques for content checking, we introduce an average de- 
lay of only 600 milliseconds to the start of rendering in 
the client browser. This is small considering the amount 
of work performed and relative to the many seconds re- 
quired to fully render a page. 

The remainder of the paper proceeds as follows. Sec- 
tion 2 presents the architecture and implementation of 
SpyProxy, our prototype proxy-based malware defense 
system. Section 3 describes the performance optimiza- 
tions that we used to achieve acceptable latency. In 
section 4 we evaluate the effectiveness and performance 
of our SpyProxy prototype. Section 5 discusses related 
work and we conclude in Section 6. 


2 Architecture and Implementation 


This section describes SpyProxy—an execution- 
based proxy system that protects clients from malicious, 
active Web objects. We begin our discussion by plac- 
ing SpyProxy in the context of existing malware defenses 
and outlining a set of design goals. We next describe the 
architecture of SpyProxy and the main challenges and 
limitations of our approach. 


2.1 Defending Against Modern Web Threats 


Over the past several years, attackers have routinely 
exploited vulnerabilities in today’s Web browsers to in- 
fect users with malicious code such as spyware. Our 
crawler-based study of Web content in October 2005 
found that a surprisingly large fraction of pages con- 
tained drive-by-download attacks [28]. A drive-by- 
download attack installs spyware when a user simply vis- 
its a malicious Web page. 

Many defenses have been built to address this prob- 
lem, but none are perfect. For example, many users in- 
stall commercial anti-spyware or anti-virus tools, which 
are typically signature-based. Many of these tools look 
only for malware that is already installed, attempting the 
difficult operation of removing it after the fact. Firewall- 
based network detectors can filter out some well-known 


and popular attacks, but they typically rely on static scan- 
ning to detect exploits, limiting their effectiveness. They 
also require deployment of hardware devices at organi- 
zational boundaries, excluding the majority of household 
users. Alternatively, users can examine blacklists or pub- 
lic warning services such as SiteAdvisor [41] or Stop- 
Badware [43] before visiting a Web site, but this can be 
less reliable [5, 44]. 

None of these defenses can stop zero-day attacks 
based on previously unseen threats. Furthermore, sig- 
nature databases struggle to keep up with the rising 
number of malware variants [9]. As a result, many of 
today’s signature-based tools fail to protect users ade- 
quately from malicious code on the Web. 


2.2 Design Goals 


SpyProxy is a new defense tool that is designed for to- 
day’s Web threats. It strives to keep Web browsing con- 
venient while providing on-the-fly protection from ma- 
licious Web content, including zero-day attacks. Our 
SpyProxy architecture has three high-level goals: 


1. Safety. SpyProxy should protect clients from 
harm by preventing malicious content from reach- 
ing client browsers. 


2. Responsiveness. The use of SpyProxy should not 
impair the interactive feel and responsiveness of the 
user’s browsing experience. 


3. Transparency. The existence and operation of 
SpyProxy should be relatively invisible and com- 
patible with existing content-delivery infrastructure 
(both browsers and servers). 


Providing safety while maintaining responsiveness is 
challenging. To achieve both, SpyProxy uses several 
content analysis techniques and performance-enhancing 
optimizations that we next describe. 


2.3. Proxy-based Architecture 


Figure | shows the architecture of a simplified ver- 
sion of SpyProxy. Key components include the client 
browser, SpyProxy, and remote Web servers. When the 
client browser issues a new request to a Web server, the 
request first flows through SpyProxy where it is checked 
for safety. 

When a user requests a Web page, the browser soft- 
ware generates an HTTP request that SpyProxy must in- 
tercept. Proxies typically use one of two methods for 
this: browser configuration (specifying an HTTP proxy) 
or network-level forwarding that transparently redirects 
HTTP requests to a proxy. Our prototype system cur- 
rently relies on manual browser configuration. 





28 


16th USENIX Security Symposium 


USENIX Association 


sient ' URL ae ‘UI 
ient URL ! proxy qui 1 Web 
browser - front end 
‘ root ¥ root 








Web cache |g 
























































: page page 
i vM 
! worker 
{ed id a wal eget 
SpyProxy 
(a) 
; et 
client i proxy _— Squid Web 
browser ! re Web cache & 
' VM 1 
! worker 
es ween ke ee 
SpyProxy 
(b) 
igen Bee eM eee ee a 
1 1 
client proxy 9 CC SccHiid ' Web 
browser front end Web cache |: 
1 1 
safe i 
1 
1 
; vM 
! worker 
ue LR 
SpyProxy 


Figure 1: SpyProxy architecture. (a) A client browser re- 
quests a Web page; the proxy front end intercepts the request, 
retrieves the root page, and statically analyzes it for safety. (b) 
If the root page cannot be declared safe statically, the front end 
forwards the URL to a VM worker. A browser in the VM down- 
loads and renders the page content. All HTTP transfers flow 
through the proxy front end and a Squid cache. (c) If the page 
is safe, the VM notifies the front end, and the page content is 
released to the client browser from the Squid cache. Note that 
if the page has been cached and was previously determined to 
be safe, the front end forwards it directly to the client. 


The SpyProxy front end module receives clients’ 
HTTP requests and coordinates their processing, as 
shown in Figure I(a). First, it fetches the root page us- 
ing a cache module (we use Squid in our prototype). 
If the cache misses, it fetches the data from the Web, 
caching it if possible and then returning it to the front 
end. Second, the front end statically analyzes the page 
(described below) to determine whether it is safe. If safe, 
the proxy front end releases the root page content to the 
client browser, and the client downloads and renders it 
and any associated embedded objects. 


If the page cannot be declared safe statically, the front 
end sends the page’s URL to a virtual machine (VM) 
worker for dynamic analysis (Figure 1(b)). The worker 
directs a browser running in its VM to fetch the requested 
URL, ultimately causing it to generate a set of HTTP 
requests for the root page and any embedded objects. 
We configure the VM’s browser to route these requests 


first through the front end and then through the locally 
running Squid Web cache. Routing it through the front 
end facilitates optimizations that we will describe in Sec- 
tion 3. Routing the request through Squid lets us reduce 
interactions with the remote Web server. 

The browser in the VM worker retrieves and renders 
the full Web page, including the root page and all embed- 
ded content. Once the full page has been rendered, the 
VM worker informs the front end as to whether it has de- 
tected suspicious activity; this is done by observing the 
behavior of the page during rendering, as described be- 
low. If so, the front end notifies the browser that the page 
is unsafe. If not, the front end releases the main Web 
page to the client browser, which subsequently fetches 
and downloads any embedded objects (Figure 1(c)). 


2.3.1 Static Analysis of Web Content 


On receiving content from the Internet, the SpyProxy 
front end first performs a rudimentary form of static anal- 
ysis, as previously noted. The goal of static analysis is 
simple: if we can verify that a page is safe, we can pass 
it directly to the client without a sophisticated and costly 
VM-based check. If static analysis were our only check- 
ing technique, our analysis tool would need to be com- 
plex and complete. However, static analysis is just a per- 
formance optimization. Content that can be analyzed and 
determined to be safe is passed directly to the client; con- 
tent that cannot is passed to a VM worker for additional 
processing. 

Our static analyzer is conservative. If it cannot iden- 
tify or process an object, it declares it to be potentially 
unsafe and submits it to a VM worker for examination. 
For example, our analyzer currently handles normal and 
chunked content encodings, but not compressed content. 
Future improvements to the analyzer could reduce the 
number of pages forwarded to the VM worker and there- 
fore increase performance. 

When the analyzer examines a Web page, it tries to 
determine whether the page is active or passive. Ac- 
tive pages include executable content, such as ActiveX, 
JavaScript, and other code; passive pages contain no such 
interpreted or executable code. Pages that contain active 
content must be analyzed dynamically. 

It is possible for seemingly passive content to com- 
promise the user’s system if the renderer has security 
holes. Such flaws have occurred in the past in both the 
JPEG and PNG image libraries. For this reason, we con- 
sider any non-HTML content types to be unsafe and send 
them for dynamic processing. In principle, a browser’s 
HTML processor could have vulnerabilities in it as well; 
it is possible to configure SpyProxy to disable all static 
checking if this is a concern. 

We validated the potential benefits of static checking 
with a small measurement study, where we collected a 





USENIX Association 


16th USENIX Security Symposium 


29 


17-hour trace of Web requests generated by the user pop- 
ulation in our department. We saw that 54.8% of HTML 
pages transferred contain passive content. Thus, there 
can be significant benefit in identifying these pages and 
avoiding our VM-based check for them. 


2.3.2 Execution-based Analysis through VM-based 
Page Rendering 


A drive-by download attack occurs when a Web page 
exploits a flaw in the victim’s browser. In the worst 
case, an attack permits the attacker to install and run arbi- 
trary software on the victim’s computer. Our execution- 
based approach to detecting such attacks is adapted from 
a technique we developed in our earlier spyware mea- 
surement study [28], where we used virtual machines to 
determine whether a Web page had malicious content. 
We summarize this technique here. 

Our detection method relies on the assumption that 
malicious Web content will attempt to break out of the 
security sandbox implemented by the browser. For ex- 
ample, the simple act of rendering a Web page should 
never cause any of the following side-effects: the cre- 
ation of a new process other than known helper appli- 
cations, modifications to the file system outside of safe 
folders such as the browser cache, registry modifications, 
browser or OS crashes, and so on. If we can determine 
that a Web page triggers any of these unacceptable con- 
ditions, we have proof that the Web page contains mali- 
cious content. 

To analyze a Web page, we use a “clean” 
VMware [45] virtual machine configured with unneces- 
sary services disabled. We direct an unmodified browser 
running in the VM to fetch and render the Web page. Be- 
cause we disabled other services, any side effects we ob- 
serve must be caused by the browser rendering the Web 
page. We monitor the guest OS and browser through 
“triggers” installed to look for sandbox violations, in- 
cluding those listed above. If a trigger fires, we declare 
the Web page to be unsafe. This mechanism is described 
in more detail in [28]. 

Note that this technique is behavior-based rather than 
signature-based. We do not attempt to characterize vul- 
nerabilities; instead, we execute or render content to 
look for evidence of malicious side-effects. Accordingly, 
given a sufficiently comprehensive set of trigger condi- 
tions, we can detect zero-day attacks that exploit vulner- 
abilities that have not yet been identified. 


2.4 Limitations 


Our approach is effective, but has a number of chal- 
lenges and limitations. First, the overhead of cloning a 
VM, rendering content within it, and detecting trigger 
conditions is potentially high. In Section 3 we describe 
several optimizations to eliminate or mask this overhead, 


and we evaluate the success of these optimizations in 
Section 4. Second, our trigger monitoring system should 
be located outside the VM rather than inside it, to prevent 
it from being tampered with or disabled by the malware 
it is attempting to detect. Though we have not done so, 
we believe we could modify our implementation to use 
techniques such as VM introspection [18] to accomplish 
this. Third, pre-executing Web content on-the-fly raises 
several correctness and completeness issues, which we 
discuss below. 


2.4.1 Non-determinism 


With SpyProxy in place, Web content is rendered 
twice, once in the VM’s sandboxed environment and 
once on the client. For our technique to work, all at- 
tacks must be observed by the VM: the client must never 
observe an attack that the VM-based execution missed. 
This will be true if the Web content is deterministic and 
follows the same execution path in both environments. In 
this way, SpyProxy is ideal for deterministic Web pages 
that are designed to be downloaded and displayed to the 
user as information. 

However, highly interactive Web pages resemble 
general-purpose programs whose execution paths depend 
on non-deterministic factors such as randomness, time, 
unique system properties, or user input. An attacker 
could use non-determinism to evade detection. For ex- 
ample, a malicious script could flip a coin to decide 
whether to carry out an attack; this simple scheme would 
defeat SpyProxy 50% of the time. 

AS a more pertinent example, if a Web site relies on 
JavaScript to control ad banner rotation, it is possible that 
the VM worker will see a benign ad while the client will 
see a malicious ad. Note, however, that much of Inter- 
net advertising today is served from ad networks such 
as DoubleClick or Advertising.com. In these systems, 
a Web page makes an image request to the server, and 
any non-determinism in picking an ad happens on the 
server side. In this case, SpyProxy will return the same 
ad to both the VM worker and the client. In general, 
only client-side non-determinism could cause problems 
for SpyProxy. 

There are some potential solutions for handling non- 
determinism in SpyProxy. Similar to ReVirt [12], we 
could log non-deterministic events in the VM and re- 
play them on the client; this likely would require exten- 
sive browser modifications. We could rewrite the page to 
make it deterministic, although a precise method for do- 
ing this is an open problem, and is unlikely to generalize 
across content types. The results of VM-based rendering 
can be shipped directly to the client using a remote dis- 
play protocol, avoiding client-side rendering altogether, 
but this would break the integration between the user’s 
browser and the rest of their computing environment. 
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None of these approaches seem simple or satisfactory; 
as a result, we consider malicious non-determinism to be 
a fundamental limitation to our approach. In our proto- 
type, we did not attempt to solve the non-determinism 
problem, but rather we evaluated its practical impact 
on SpyProxy’s effectiveness. Our results in Section 4 
demonstrate that our system detected all malicious Web 
pages that it examined, despite the fact that the major- 
ity of them contained non-determinism. We recognize 
that in the future, however, an adversary could intro- 
duce non-determinism in an attempt to evade detection 
by SpyProxy. 


2.4.2 Termination 


Our technique requires that the Web page rendering 
process terminates so that we can decide whether to 
block content or forward it to the user. SpyProxy uses 
browser interfaces to determine when a Web page has 
been fully rendered. Unfortunately, for some scripts ter- 
mination depends on timer mechanisms or user input, 
and in general, determining when or whether a program 
will terminate is not possible. 

To prevent “timebomb-based” attacks, we speed up 
the virtual time in the VM [28]. If the rendering times 
out, SpyProxy pessimistically assumes the page has 
caused the browser to hang and considers it unsafe. Post- 
rendering events, such as those that fire because of user 
input, are not currently handled by SpyProxy, but could 
be supported with additional implementation. For exam- 
ple, we could keep the VM worker active after rendering 
and intercept the events triggered because of user input to 
forward them to the VM for pre-checking. The interposi- 
tion could be accomplished by inserting run-time checks 
similar to BrowserShield [33]. 


2.4.3 Differences Between the Proxy and Client 


In theory, the execution environment in the VM and 
on the client should be identical, so that Web page ren- 
dering follows the same execution path and produces the 
same side-effects in both executions. Differing environ- 
ments might lead to false positives or false negatives. 

In practice, malware usually targets a broad audience 
and small differences between the two environments are 
not likely to matter. For our system, it is sufficient that 
harmful side-effects produced at the client are a subset 
of harmful side-effects produced in the VM. This im- 
plies that the VM system can be partially patched, which 
makes it applicable for all clients with a higher patch 
level. Currently, SpyProxy uses unpatched Windows 
XP VMs with an unpatched IE browser. As a result, 
SpyProxy is conservative and will block a threat even if 
the client is patched to defend against it. 

There is a possibility that a patch could contain a bug, 
causing a patched client to be vulnerable to an attack to 


which the unpatched SpyProxy is immune [24]. We as- 
sume this is a rare occurrence, and do not attempt to de- 
fend against it. 


2.5 Client-side vs. Network Deployment 


As we hinted before, SpyProxy has a flexible imple- 
mentation: it can be deployed in the network infrastruc- 
ture, or it can serve as a client-side proxy. There are 
many tradeoffs involved in picking one or the other. For 
example, a network deployment lets clients benefit from 
the workloads of other clients through caching of both 
data and analysis results. On the other hand, a client-side 
approach would remove the bottleneck of a centralized 
service and the latency of an extra network hop. How- 
ever, clients would be responsible for running virtualiza- 
tion software that is necessary to support SpyProxy’s VM 
workers. Many challenges, such as latency optimizations 
or non-determinism issues, apply in both scenarios. 

While designing our prototype and carrying out our 
evaluation, we decided to focus on the network-based 
SpyProxy. In terms of effectiveness, the two approaches 
are identical, but obtaining good performance with a net- 
work deployment presents more challenges. 


3 Performance Optimizations 


The simple proxy architecture described in section 2 
will detect and block malicious Web content effectively, 
but it will perform poorly. For a given Web page request, 
the client browser will not receive or render any content 
until the proxy has downloaded the full page from the re- 
mote Web server, rendered it in a VM worker, and satis- 
fied itself that no triggers have fired. Accordingly, many 
of the optimizations that Web browsers perform to mini- 
mize perceived latency, such as pipelining the transfer of 
embedded objects and the rendering of elements within 
the main Web page, cannot occur. 

To mitigate the cost of VM-based checking in our 
proxy, we implemented a set of performance optimiza- 
tions that either enable the browser to perform its normal 
optimizations or eliminate proxy overhead altogether. 


3.1 Caching the Result of Page Checks 


Web page popularity is known to follow a Zipf dis- 
tribution [6]. Thus, a significant fraction of requests 
generated by a user population are repeated requests for 
the same Web pages. Web proxy caches take advantage 
of this fact to reduce Web traffic and improve response 
times [1, 13, 15, 21, 52, 53]. Web caching studies gener- 
ally report hit rates as high as 50%. 

Given this, our first optimization is caching the result 
of our security check so that repeated visits to the same 
page incur the overhead of our VM-based approach only 
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once. In principle, the hit rate in our security check cache 
should be similar to that of Web caches. 

This basic idea faces complications. The principle 
of complete mediation warns against caching security 
checks, since changes to the underlying security pol- 
icy or resources could lead to caching an incorrect out- 
come [34]. In our case, if any component in a Web page 
is dynamically generated, then different clients may be 
exposed to different content. However, in our architec- 
ture, our use of the Squid proxy ensures that no confu- 
sion can occur: we cache the result of a security check 
only for objects that Squid also caches, and we invalidate 
pages from the security cache if any of the page’s objects 
is invalid in the Squid cache. Thus, we generate a hit 
in the security cache only if all of the Web page content 
will be served out of the Squid proxy cache. Caching 
checks for non-deterministic pages is dangerous, and we 
take the simple step of disabling the security cache for 
such pages. 


3.2 Prefetching Content to the Client 


In the unoptimized system shown in Figure 1, the 
Web client will not receive any content until the entire 
Web page has been downloaded, rendered, and checked 
by SpyProxy. As a result, the network between the 
client and the proxy remains idle when the page is be- 
ing checked. If the client has a low-bandwidth network 
connection, or if the Web page contains large objects, 
this idle time represents a wasted opportunity to begin 
the long process of downloading content to the client. 

To rectify this, SpyProxy contains additional com- 
ponents and protocols that overlap several of the steps 
shown in Figure 1. In particular, a new client-side com- 
ponent acts as a SpyProxy agent. The client-side agent 
both prefetches content from SpyProxy and releases it 
to the client browser once SpyProxy informs it that the 
Web page is safe. This improves performance by trans- 
mitting content to the client in parallel with checking the 
Web page in SpyProxy. Because we do not give any Web 
page content to the browser before the full page has been 
checked, this optimization does not erode security. 

In our prototype, we implemented the client-side 
agent as an IE plugin. The plugin communicates with 
the SpyProxy front end, spooling Web page content and 
storing it until SpyProxy grants it authorization to release 
the content to the browser. 


3.3. The Staged Release of Content 


Although prefetching allows content to be spooled 
to the client while SpyProxy is performing its security 
check, the user’s browser cannot begin rendering any of 
that content until the full Web page has been rendered 
and checked in the VM worker. This degrades respon- 
siveness, since the client browser cannot take advantage 
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Figure 2: Staged release optimization. The progression of 
events in the VM worker’s browser shows how staged release 
operates on a Web page with two embedded objects. As em- 
bedded objects become fully downloaded and rendered by the 
VM worker’s browser, more of the Web page is released to the 
client-side browser. 


of its performance optimizations that render content well 
before the full page has arrived. 


We therefore implemented a “staged release” opti- 
mization. The goal of staged release is to present con- 
tent considered safe for rendering to the client browser 
in pieces; as soon as the proxy believes that a slice of 
content (e.g., an object or portion of an HTML page) is 
safe, it simultaneously releases and begins transmitting 
that content to the client. 

Figure 2 depicts the process of staged release. A page 
consists of a root page (typically containing HTML) and 
a set of embedded objects referred to from within the root 
page. As a Web browser downloads and renders more 
and more of the root page, it learns about embedded ob- 
jects and begins downloading and rendering them. 


Without staged release, the proxy releases no content 
until the full Web page and its embedded objects have 
been rendered in the VM. With staged release, once the 
VM has rendered an embedded object, it releases that 
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object and all of the root page content that precedes its 
reference. If the client browser evaluates the root page in 
the same order as the VM browser, this is safe to do; our 
results in Section 4 confirm this optimization is safe in 
practice. Thus, a pipeline is established in which content 
is transmitted and released incrementally to the client 
browser. 

In Figures 2(a) and 2(b), only part of the main Web 
page has been downloaded and rendered by the VM 
browser. In Figure 2(c), all of the first embedded object 
has been rendered by the VM, which causes that object 
and some of the main Web page content (shown in black) 
to be released and transmitted to the client browser. More 
of the main Web page and the second embedded object 
is downloaded and rendered in Figure 2(d), until finally, 
in Figures 2(e) and 2(f), the full Web page is released. 

Many Web pages contain dozens of embedded im- 
ages. For example, CNN’s Web page contains over 32 
embedded objects. Faced with such a Web page, our 
staged release optimization quickly starts feeding the 
client browser more and more of the root page and as- 
sociated embedded objects. As a result, the user does not 
observe expensive Web access delay. 

Note that staged release is independent from prefetch- 
ing. With prefetching, content is pushed to the client-side 
agent before SpyProxy releases it to the client browser; 
however, no content is released until the full page is 
checked. With staged release, content is released incre- 
mentally, but released content is not prefetched. Staged 
release can be combined with prefetching, but since it 
does not require a client-side agent to function, it may 
be advantageous to implement staged release without 
prefetching. We evaluate each of these optimizations in- 
dependently and in combination in Section 4. 


3.4 Additional optimizations 


SpyProxy contains a few additional optimizations. 
First, the VM worker is configured to have a browser 
process already running inside, ready to accept a URL 
to retrieve. This avoids any start-up time associated with 
booting the guest OS or launching the browser. Sec- 
ond, the virtual disk backing the VM worker is stored 
in a RAM-disk file system in the host OS, eliminating 
the disk traffic associated with storing cookies or files 
in the VM browser. Finally, instead of cloning a new 
VM worker for every client request, we re-use VM work- 
ers across requests, garbage collecting them only after a 
trigger fires or a configurable number of requests has oc- 
curred. Currently, we garbage collect a worker after 50 
requests. 


4 Evaluation 


This section evaluates the effectiveness and perfor- 
mance of our SpyProxy architecture and prototype. The 





browser exploits 27 


malicious pages visited : 























# sites containing the malicious pages 45 

malicious pages blocked by SpyProxy 100% 
malicious domains identified by SiteAdvisor 80% 
malicious pages containing non-determinism 96% 





Table 1: Effectiveness of SpyProxy. The effectiveness of 
SpyProxy at detecting and blocking malicious Web content. 
SpyProxy was successful at detecting and blocking 100% of 
the malicious Web pages we visited, in spite of the fact that 
most of them contained non-determinism. In comparison, the 
SiteAdvisor service incorrectly classified 20% of the malicious 
Web domains as benign. 


prototype includes the performance optimizations we de- 
scribed previously. Our results address three key ques- 
tions: how effective is our system at detecting and block- 
ing malicious Web content, how well do our performance 
optimizations mask latency from the user, and how well 
does our system perform given a realistic workload? 


4.1 Effectiveness at Blocking Malicious Code 


We first consider the ability of SpyProxy to success- 
fully block malicious content. To quantify this, we man- 
ually gathered a list of 100 malicious Web pages on 45 
distinct sites. Each of these pages performs an attack of 
some kind. We found these pages using a combination of 
techniques, including: (1) searching Google for popular 
Web categories such as music or games, (2) mining pub- 
lic blacklists of known attack sites, and (3) examining 
public warning services such as SiteAdvisor. 

Some of the Web pages we found exploit browser vul- 
nerabilities to install spyware. Others try to “push” ma- 
licious software at clients spontaneously, requiring user 
consent to install it; we have configured SpyProxy to 
automatically accept such prompts to evaluate its effec- 
tiveness at blocking these threats. The pages include a 
diversity of attack methods, such as the WMF exploit, 
ActiveX controls, applet-based attacks, JavaScript, and 
pop-up windows. A successful attack inundates the vic- 
tim with adware, dialer, and Trojan downloader software. 

Table 1 quantifies the effectiveness of our system. 
SpyProxy detected and blocked 100% of the attack 
pages, despite the diversity of attack methods to which 
it was exposed. Further, most of these attack pages con- 
tained some form of non-deterministic content; in prac- 
tice, none of the attacks we found attempted to evade 
detection by “hiding” inside non-deterministic code. 

The table also shows the advantage of our on-the-fly 
approach compared to a system like SiteAdvisor, which 
provides static recommendations based on historical ev- 
idence. SiteAdvisor misclassified 20% of the malicious 
sites as benign. While we cannot explain why SiteAdvi- 
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sor failed on these sites, we suspect it is due to a combi- 
nation of incomplete Web coverage (i.e., not having ex- 
amined some pages) and stale information (i.e., a page 
that was benign when examined has since become ma- 
licious). SpyProxy’s on-the-fly approach examines Web 
page content as it flows towards the user, resulting in a 
more complete and effective defense. 

For an interesting example of how SpyProxy works in 
practice, consider www.crackz.ws, one of our 100 mali- 
cious pages. This page contains a specially crafted im- 
age that exploits a vulnerability in the Windows graph- 
ics rendering engine. The exploit runs code that silently 
downloads and installs a variety of malware, including 
several Trojan downloaders. Many signature-based anti- 
malware tools would not prevent this attack from suc- 
ceeding; they would instead attempt to remove the mal- 
ware after the exploit installs it. 

In contrast, when SpyProxy renders a page from 
www.crackz.ws in a VM, it detects the exploit when the 
page starts performing unacceptable activity. In this case, 
as the image is rendered in the browser, SpyProxy de- 
tects an unauthorized creation of ten helper processes. 
SpyProxy subsequently blocks the page before the client 
renders it. Note that SpyProxy does not need to know 
any details of the exploit to stop it. Equally important, 
in spite of the fact that the exploit attacks a non-browser 
flaw that is buried deep in the software stack, SpyProxy’s 
behavior-based detection allowed it to discover and pre- 
vent the attack. 


4.2 Performance of the Unoptimized System 


This section measures the performance of the basic 
unoptimized SpyProxy architecture we described in Sec- 
tion 2.3. These measurements highlight the limitations 
of the basic approach; namely, unoptimized SpyProxy 
interferes with the normal browser rendering pipeline by 
delaying transmission until an entire page is rendered and 
checked. They also suggest opportunities for optimiza- 
tion and provide a baseline for evaluating the effective- 
ness of those optimizations. 

We ran a series of controlled measurements, testing 
SpyProxy under twelve configurations that varied across 
the following three dimensions: 


e Proxy configuration. We compared a regular 
browser configured to communicate directly with 
Web servers with a browser that routes its requests 
through the SpyProxy checker. 


e Client-side network. We compared a browser 
running behind an emulated broadband connection 
with a browser running on the same gigabit Ethernet 
LAN as SpyProxy. We used the client-side NetLim- 
iter tool and capped the upload and download client 



































Google NY Times MSN blog 
render render | render render | render render 
begins | ends begins : ends begins | ends 

direct 0.21s | 064s | 041s | 48s | 0.40s ; 102s 
unoptimized | 79s | 128 | 348 | 73s | 27s | 124s 
SpyProxy ; : : 
(a) broadband 

Google NY Times MSN blog 
render render | render render | render render 
begins : ends begins ; ends begins ; ends 

direct 0.20s | 0.63s | 041s | 33s | 036s | 23s 
unoptimized | g79s | 128 | 348 | 53s | 278 | 39s 
SpyProxy i : : 

















(b) gigabit 


Table 2: Performance of the unoptimized SpyProxy. These 
tables compare the latency of an unprotected browser that 
downloads content directly from Web servers to that of a pro- 
tected browser downloading through the SpyProxy service. We 
show the latency until the page begins to render on the client 
and the latency until the page finishes rendering. The data are 
shown for three Web pages as well the client on (a) an emulated 
broadband access link, and (b) the same LAN as SpyProxy. 


bandwidth at 1.5 Mb/s to emulate the broadband 
connection. 


e Web page requested. We measured three different 
Web pages: the Google home page, the front page 
of the New York Times, and the “MSN shopping in- 
sider” blog, which contains several large, embedded 
images. The Google page is small: just 3,166 bytes 
of HTML and a single 8,558 byte embedded GIF. 
The New York Times front page is larger and more 
complex: 92KB of HTML, 74 embedded images, 4 
stylesheets, 3 XML objects, | flash animation, and 
10 embedded JavaScript objects. This represents 
844KB of data. The MSN blog consists of a 79KB 
root HTML page, 18 embedded images (the largest 
of which is 176KB), 2 stylesheets, and 1 embedded 
JavaScript object, for a total of 1.4MB of data. 


For each of the twelve configurations, we created 
a timeline showing the latency of each step from the 
client’s Web page request to the final page rendering in 
the client. We broke the end-to-end latency into several 
components, including WAN transfer delays, the over- 
head of rendering content in the VM worker before re- 
leasing it to the client, and internal communication over- 
head in the SpyProxy system itself. We cleared all caches 
in the system to ensure that content was retrieved from 
the original Web servers in all cases. 

For each configuration, Table 2 shows the time until 
content first begins to render on the user’s screen and the 
time until the Web page finishes rendering. In all cases, 
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(ms) event 

0 user requests URL, browser generates HTTP request 
169 SpyProxy FE receives request, requests root page from Squid 
538 SpyProxy FE finishes static check, forwards URL to VM 
560 VM browser generates HTTP request 
561 first byte of root page arrives at VM browser 
3055 last byte of last page component arrives at VM browser 
3363 VM browser finishes rendering, checking triggers 
3374 first byte of root page arrives at client browser 
7334 last byte of last page component arrives at client browser 
7347 client browser finishes rendering content 





client browser transfer and render time: 4.5s 
overhead introduced by VM browser: 2.8s 
other SpyProxy system overhead: 0.05s 


Table 3: Detailed breakdown of the unoptimized SpyProxy. 
Events occurring when fetching the New York Times page over 
broadband through SpyProxy. Most SpyProxy overhead is due 
to serializing the VM browser download and trigger checks be- 
fore transferring or releasing content to the client browser. 


the unoptimized SpyProxy implementation added less 
than three seconds to the total page download time. How- 
ever, the time until rendering began was much higher on 
the unoptimized system, growing in some cases by a fac- 
tor of ten. This confirms that our system can perform 
well, but, without optimizations, it interferes with the 
browser’s ability to reduce perceived latency by pipelin- 
ing the transfer and rendering of content. 

Table 3 provides a more detailed timeline of events 
when fetching the New York Times page from a broad- 
band client using the unoptimized SpyProxy. Download- 
ing and rendering the page in the VM browser introduced 
2.8 seconds of overhead. Since no data flows to the client 
browser until SpyProxy finishes rendering and checking 
content, this VM rendering latency is responsible for de- 
lay experienced by the user. 


4.3 Performance Optimizations 


To reduce the overhead introduced by the unoptimized 
SpyProxy system, we previously described three opti- 
mization techniques: prefetching content to a client-side 
agent, the staged release of content to the client browser, 
and caching the results of security checks. We now 
present the results of a set of microbenchmarks that eval- 
uate the impact of each optimization. 

Figure 3 summarizes the benchmark results. Both fig- 
ures show the latency to download three different pages 
to a client on the emulated broadband connection. For 
each page, we show latency for five cases: (1) the unop- 
timized SpyProxy, (2) SpyProxy with only prefetching 
enabled, (3) SpyProxy with only staged release enabled, 
(4) SpyProxy with a hit in the enabled security cache, and 
(5) the base case of a client fetching content directly from 
Web servers. Figure 3(a) shows the latency before page 
rendering begins in the client browser, while Figure 3(b) 


b 
2 
3 
o 





Munoptimized 

prefetching only 
[istaged release only 
3000 +. Elcache hit only oo 
direct 














2000 4o2enn aes eecssseteeeteecassiciest 


i 00,0) epreeeespeeeestreSrereereteceees 























latency to start rendering (ms) 





























° 






























































Google New York Times MSN Blog 
(a) 
@ 14000 
£ Hunoptimized 
‘B 12000 +-. Niprefetching only 
ec mercer lea only 
= Elcache hit only 
cS 10000 + Fidirect 
< 
BD BOOT fever rrcceserscessnnnnersacsnenesessrscnsa smmeereateenaneserssninns 
a 
SB 6000 oo ee eee eee eeeeeies 
= 
B 4D0D 6+ oa 22st nts enn eect 
z 
E2000 Hrociitentretintiesesiiies 
2 
Google New York Times MSN Blog 


(b) 


Figure 3: Performance of optimizations (broadband). The 
latency until the client browser (a) begins rendering the page, 
and (b) finishes rendering the page. Each graph shows the la- 
tency for three different pages for five configurations. 























Google NY Times MSN blog 
render render | render render | render render 
begins :| ends begins :| ends begins : ends 

unoptimized i i 

SpyProxy 0.79s 1.21s 3.3758 7.38 2.7S 12.4s 

prefetching 78s | 115s | 3.43s | 5.2s 22s | 11.38 
only (-0.01s) \ (-0.06s) | (+0.06s) | (-2.1s) (-0.5s) ! (-1.1s) 





Table 4: Prefetching (broadband). Latency improvements 
gained by the prefetching optimization in the broadband envi- 
ronment. Prefetching alone did not yield significant benefits. 


shows the latency until page rendering ends. 


In combination, the optimizations serve to reduce the 
latency before the start of rendering in the client. With 
all of the the optimizations in place, the page load “feels” 
nearly as responsive through SpyProxy as it does with- 
out SpyProxy. In either case, the page begins render- 
ing about a second after the request is generated. The 
optimizations did somewhat improve the total render- 
ing latency relative to the unoptimized implementation 
(Figure 3(b)), but this was not nearly as dramatic. Page 
completion time is dominated by transfer time over the 
broadband network, and our optimizations do nothing to 
reduce this. 
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render tender | render render | render render render tender | render tender | render render 
begins :; ends begins : ends begins : ends begins } ends begins : ends begins : ends 

unoptimized | 795 | 1218 | 337s | 7.38 | 278 | 124s unoptimized | o795 | 421s | 337s 7.38 | 278 | 124s 
SpyProxy SpyProxy ' ' : 

staged release 064s | 1.13s 0.92s | 5.2s 1.3s 11.38 security 0.23s | 0.64s 071s | 46s 05s | 10.2s 

only (-0.15s) ; (-0.08s) | (-2.45s) | (-2.1s) | (-1.4s) | (-1.1s) cache hit (-0.56s) { (-0.57s) | (-2.66s) | (-2.7s) | (-2.2s) : (-2.2s) 




















Table 5: Staged release (broadband). Latency improvements 
from staged release in the broadband environment. Staged re- 
lease significantly improved the latency until rendering starts. 
It yielded improvements similar to prefetching in the latency 
until full page rendering ends. 


4.3.1 Prefetching 


Prefetching by itself does not yield significant ben- 
efits. As shown in Table 4, it did not reduce render- 
ing start-time latency. With prefetching alone, the client 
browser effectively stalls while the VM browser down- 
loads and renders the page fully in the proxy. That is, 
SpyProxy does not release content to the client’s browser 
until the VM-based check ends. 

However, we did observe some improvement in 
finish-time measurements. For example, the time to fully 
render the New York Times page dropped by 2.1 seconds, 
from 7.3 seconds in the unoptimized SpyProxy to 5.2 
seconds with prefetching enabled. Prefetching success- 
fully overlaps some transmission of content to the client- 
side agent with SpyProxy’s security check, slightly low- 
ering overall page load time. 


4.3.2 Staged Release 


Staged release very successfully reduced initial la- 
tency before rendering started; this time period has the 
largest impact on perceived responsiveness. As shown 
in Table 5, staged release reduced this latency by sev- 
eral seconds for both the New York Times and MSN blog 
pages. In fact, from the perspective of a user, the New 
York Times page began rendering nearly four times more 
quickly with staged release enabled. For all three pages, 
initial rendering latency was near the one-second mark, 
implying good responsiveness. 

The staged release optimization also reduced the la- 
tency of rendering the full Web page to nearly the same 
point as prefetching. Even though content does not start 
flowing to the client until it is released, this optimiza- 
tion releases some content quickly, causing an overlap of 
transmission with checking that is similar to prefetching. 

Staged release outperforms prefetching in the case 
that matters—initial time to rendering. It also has the 
advantage of not requiring a client-side agent. Once 
SpyProxy decides to release content, it can simply begin 
uploading it directly to the client browser. Prefetching 
requires the installation of a client-side software compo- 





Table 6: Security cache hit (broadband). This table shows 
the latency improvements gained when the security cache opti- 
mization is enabled and the Web page hits in the cache. 


nent, and it provides benefits above staged release only 
in a narrow set of circumstances (namely, pages that con- 
tain very large embedded objects). 

To better visualize the impact of staged release, Fig- 
ure 4 depicts the sequence of Web object completion 
events that occur during the download and rendering of a 
page. Figure 4(a) shows completion events for the New 
York Times page. The unoptimized SpyProxy (top) does 
not transmit or release events to the client browser until 
the full page has rendered in the VM. With staged re- 
lease (4(a) bottom), as objects are rendered and checked 
by the SpyProxy VM, they are released and transmitted 
to the client browser and then rendered. Accordingly, the 
sequence of completion events is pipelined between the 
two browsers. This leads to much more responsive ren- 
dering and an overall lower page load time. 

Figure 4(b) shows a similar set of events for the MSN 
blog page. Since this page consists of few large embed- 
ded images, the dominant cost in both the unoptimized 
and staged-release-enabled SpyProxy implementations is 
the time to transmit the images to the client over broad- 
band. Accordingly, though staged release permits the 
client browser to begin rendering more quickly, most ob- 
jects queue up for transmission over the broadband link 
after being released by SpyProxy. 


4.3.3 Caching 


When a client retrieves a Web page using the opti- 
mized SpyProxy, both the outcome of the security check 
and the page’s content are cached in the proxy. When 
a subsequent request arrives for the same page, if any 
of its components are cached and still valid, our system 
avoids communicating with the origin Web server. In ad- 
dition, if all components of the page are cached and still 
valid, the system uses the previous security check results 
instead of incurring the cost of a VM-based evaluation. 

In Table 6, we show the latency improvement of hit- 
ting in the security cache compared with the unoptimized 
SpyProxy. As with the other optimizations, the primary 
benefit of the security cache is to improve the latency un- 
til the page begins rendering. Though the full page load 
time improves slightly, the transfer time over the broad- 
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Figure 4: Timeline of events with staged release (broadband). The sequence of object rendering completion events that occur 
over time for (a) the New York Times, and (b) MSN blog pages. The top figures show the sequence of events for the unoptimized 
SpyProxy, while the bottom figures show what happens with staged release. In each figure, the top series of dots represents 
completions in the client browser and the bottom series in SpyProxy’s VM browser. Staged release is effective at the early release 


of objects to the browser. 


band connection still dominates. However, on a security 
cache hit, the caching optimization is extremely effec- 
tive, since it eliminates the need to evaluate content in a 
VM. 


4.4 Performance on a Realistic Workload 


Previous sections examined the individual impact of 
each of our optimizations. In the end, however, the ques- 
tion remains: how does SpyProxy perform for a “typ- 
ical” user Web-browsing workload? A more realistic 
workload will cause the performance optimizations — 
caching, static analysis, and staged release — to be exer- 
cised together in response to a stream of requests. 

To study the behavior of SpyProxy when confronted 
with a realistic request stream, we measured the re- 
sponse latencies of 1,909 Web page requests issued by 
our broadband Web client. These requests were gener- 
ated with a Zipf popularity distribution drawn from a list 
of 703 different safe URLs from 124 different sites. We 
chose the URLs by selecting a range of popular and un- 
popular sites ranked by the Alexa ranking service. By 
selecting real sites, we exercised our system with the dif- 
ferent varieties and complexities of Web page content to 
which users are typically exposed. By generating our 
workload with a Zipf popularity distribution, we gave 
our caching optimization the opportunity to work in a 
realistic scenario. None of the sites we visited contained 
attacks; our goal was simply to evaluate the performance 
impact of SpyProxy on browsing. 

Figure 5(a) presents a cumulative distribution func- 
tion for the time to start page rendering in the client 
browser. This is the delay the user sees before the 
browser responds to a request. Figure 5(b) shows the 
CDF for full-page-load latencies. Each figure depicts 
distributions for three cases: (1) directly connecting to 
the Web site without SpyProxy, (2) using the optimized 
SpyProxy implementation, and (3) using SpyProxy with 
optimizations disabled. We flushed all caches before 
gathering the data for each distribution. 
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Figure 5: Overall performance (broadband). These graphs 
show the distributions of (a) render start latencies and (b) full 
page load latencies for a workload consisting of 1,909 requests 
issued to 703 pages from 102 Web sites. Each graph compares 
the response time for a direct client, the unoptimized system, 
and the fully optimized system. The artifact visible at low la- 
tencies on the optimized line in (a) corresponds to hits in our 
security cache. 


Our results demonstrate that the optimized SpyProxy 
system delivers content to browsers very quickly. The 
median time until rendering began was 0.8 seconds in 
the optimized system compared to 2.4 seconds in the un- 
optimized system. There is still room to improve; the 
median start time for the direct connection was 0.2 sec- 
onds. However, the optimized system feels acceptably 
fast to a user. In contrast, the unoptimized system seems 
noticeably sluggish compared to the optimized system 
and direct connections. 

A typical request flowing through the optimized sys- 
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tem involves several potential sources of overhead, in- 
cluding interacting with the Squid proxy cache and pre- 
executing content in a virtual machine. In spite of this, 
the optimized SpyProxy effectively masks latency, re- 
sulting in an interactive, responsive system. In addition, 
our system generated very few false positives: only 4 of 
the 1,909 Web page requests resulted in an alarm being 
raised. Even though the offending pages were benign, 
they did in fact attempt to install software on the user’s 
computer, albeit by requesting permission from the user 
first. For example, one of the pages prompted the user 
to install a browser plug-in for the QuickTime media 
player. We chose not to deal with such opt-in installers, 
as SpyProxy is primarily intended for zero-day attacks 
that never ask for permission before installing malware. 
However, we do reduce the number of false positives by 
including the most common browser plug-ins, such as 
Flash, in the base VM image. 


4.5 Scalability 


SpyProxy is designed to service many concurrent 
users in an organizational setting. Our implementation 
runs on a cluster of workstations, achieving incremen- 
tal scalability by executing VM workers on additional 
nodes. We now provide some back-of-the-envelope es- 
timations of SpyProxy’s scalability. We have not per- 
formed an explicit scaling benchmark, but our calcula- 
tions do provide an approximate indication of how many 
CPUs would be necessary to support user population of 
a given size. 

Our estimate is based on the assumption that the CPU 
is likely to be the bottleneck of a deployed system; for 
this to be true, the system must be configured with an 
adequate amount of memory and network bandwidth to 
support the required concurrent virtual machines and 
Web traffic. While performing the evaluation in sec- 
tion 4.4, we measured the amount of CPU time required 
to process a Web page in SpyProxy. On a 2.8GHz Pen- 
tium 4 machine with 4GB of RAM and a single 80GB 
7200 RPM disk, we found the average CPU time con- 
sumed per page was 0.35 seconds. 

There is little published data on the number of Web 
pages users view per day. In a study of Internet content- 
delivery systems [38], users requested 930 HTTP ob- 
jects per day on average, and another study found that 
an average Web page contains about 15 objects [27]. 
Combining these, we conservatively estimate that a typ- 
ical user browses through 100 pages per day. Assuming 
this browsing activity is uniformly distributed over an 8- 
hour workday, one CPU can process 82286 Web pages 
per day, implying a single-CPU SpyProxy could support 
approximately 822 users. A single quad-core machine 
should be able to handle the load from an organization 
containing a few thousand people. 


4.6 Summary 


This section evaluated the effectiveness and perfor- 
mance of our SpyProxy prototype. Our measurements 
demonstrated that SpyProxy effectively detects mali- 
cious content. In our experiments, SpyProxy correctly 
detected and blocked every threat, including several that 
SiteAdvisor failed to identify. Our experiments with 
fully optimized SpyProxy show that a proxy-based spy- 
ware checker can be implemented with only minimal 
performance impact on the user. On average, the use 
of SpyProxy added only 600 milliseconds to the user- 
visible latency before rendering starts. In our experience 
using the system, this small additional overhead does not 
noticeably degrade the system’s responsiveness. 


5 Related Work 


We now discuss related research on spyware detection 
and prevention, intrusion detection and firewall systems, 
and network proxies. 


5.1 Spyware and Malware Detection 


In previous work, we used passive network monitor- 
ing to measure adware propagation on the University of 
Washington campus [37]. In a follow-on study, we used 
Web crawling to find and analyze executable programs 
and Web pages that lead to spyware infections [28]; the 
trigger-based VM analysis technique in that work forms 
the foundation for SpyProxy’s detection mechanism. 

Strider HoneyMonkey [49] and the commercial 
SiteAdvisor service [41] both use a VM-based technique 
similar to ours to characterize malicious Web sites and 
pages. Our work differs in two main ways: we show that 
our VM-based technique can be used to build a transpar- 
ent defense system rather than a measurement tool, and 
we examine optimizations that enable our system to per- 
form efficiently and in real time. 

Our system detects malicious Web content by execut- 
ing it and looking for evidence of malicious side-effects. 
Other systems have attempted to detect malware by ex- 
amining side-effects, including Gatekeeper [51], which 
monitors Windows extensibility hooks for evidence of 
spyware installation. Another recent detector identifies 
spyware by monitoring API calls invoked when sensi- 
tive information is stolen and transmitted [20]. However, 
these systems only look for malware that is already in- 
stalled. In contrast, SpyProxy uses behavioral analysis 
to prevent malware installation. 

Other works have looked at addressing limitations 
of signature-based detection. Semantics-aware malware 
detection [8] uses an instruction-level analysis of pro- 
grams to match their behavior against signature tem- 
plates. This technique improves malware detection, but 
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not prevention. Several projects explore automatic gen- 
eration of signatures for detection of unknown malware 
variants [7, 29, 40, 46, 48]. These typically need at- 
tack traffic and time to generate signatures, leaving some 
clients vulnerable when a new threat first appears. 

Some commercial client-side security tools have be- 
gun to incorporate behavioral techniques, and two re- 
cent products, Prevx! [32] and Primary Response Safe- 
Connect [35], use purely behavioral detection. How- 
ever, these tools must run on systems packed with client- 
installed programs, which limits their behavioral analy- 
sis. In contrast, SpyProxy pre-executes content in a clean 
sandbox, where it can apply a much stricter set of behav- 
ioral rules. 

Other approaches prevent Web-based malware infes- 
tations by protecting the user’s system from the Web 
browser using VM isolation [10], OS-level sandbox- 
ing [16, 36], or logging/rollback [17]. Fundamentally, 
this containment approach is orthogonal to our preven- 
tion approach. Although these tools provide strong isola- 
tion, they have different challenges, such as data sharing 
and client-side performance overhead. 

Remote playgrounds move some of the browser func- 
tionality (namely, execution of untrusted Java applets) 
away from the client desktop and onto dedicated ma- 
chines [26]; the client browser becomes an I/O termi- 
nal to the actual browser running elsewhere. Our archi- 
tecture is different — SpyProxy pre-executes Web pages 
using an unmodified browser and handles any form of 
active code, allowing it to capture a wider range of at- 
tacks. Nevertheless, SpyProxy could benefit from this 
technique in the future, for example by forwarding user 
input to the VM worker in AJAX sites. 

Several projects tackle the detection and prevention 
of other classes of malware, including worms, viruses, 
and rootkits [19, 39, 50]. SpyProxy complements these 
defenses with protection against Web-borne attacks, re- 
sulting in better overall desktop security. 


5.2 Intrusion Detection and Firewalls 


Intrusion detection systems (e.g., Bro [31] and 
snort [42]) protect networks from attack by searching 
through incoming packets for known attack signatures. 
These systems are typically passive, monitoring traffic as 
it flows into a network and alerting a system administra- 
tor when an attack is suspected. More sophisticated in- 
trusion detection systems attempt to identify suspicious 
traffic using anomaly detection [3, 4, 22, 23]. A related 
approach uses protocol-level analysis to look for attacks 
that exploit specific vulnerabilities, such as Shield [47]. 
The same idea has been applied at the HTML level in 
client-side firewalls and proxies [25, 30, 33]. 

These systems typically look for attack signatures for 
well-established protocols and services. As a result, they 


cannot detect new or otherwise undiscovered attacks. 
Since they are traditionally run in a passive manner, at- 
tacks are detected but not prevented. Our system exe- 
cutes potentially malicious content in a sandboxed envi- 
ronment, using observed side-effects rather than signa- 
tures to detect attacks and protect clients. 

Shadow honeypots combine network intrusion detec- 
tion systems and honeypots [2]. They route risky net- 
work traffic to a heavily instrumented version of a vulner- 
able application, which detects certain types of attacks at 
run-time. In contrast, SpyProxy does not need to instru- 
ment the Web browser that it guards, and its run-time 
checks are more general and easier to define. 


5.3 Proxies 


Proxies have been used to introduce new services be- 
tween Web clients and servers. For example, they have 
been used to provide scalable distillation services for mo- 
bile clients [14], Web caching [1, 13, 15, 21, 52, 53], and 
gateway services for onion-routing anonymizers [11]. 
SpyProxy builds on these advantages, combining active 
content checking with standard proxy caching. Spy- 
Bye [30] is a Web proxy that uses a combination of 
blacklisting, whitelisting, ClamAV-based virus scanning, 
and heuristics to identify potentially malicious Web con- 
tent. In contrast, SpyProxy uses execution-based analy- 
sis to identify malicious content. 


6 Conclusions 


This paper described the design, implementation, and 
evaluation of SpyProxy, an execution-based malware de- 
tection system that protects clients from malicious Web 
pages, such as drive-by-download attacks. SpyProxy ex- 
ecutes active Web content in a safe virtual machine be- 
fore it reaches the browser. Because SpyProxy relies on 
the behavior of active content, it can block zero-day at- 
tacks and previously unseen threats. For performance, 
SpyProxy benefits from a set of optimizations, including 
the staged release of content and caching the results of 
security checks. 

Our evaluation of SpyProxy demonstrates that it 
meets its goals of safety, responsiveness, and trans- 
parency: 


1. SpyProxy successfully detected and blocked all of 
the threats it faced, including threats not identified 
by other detectors. 


2. The SpyProxy prototype adds only 600 millisec- 
onds of latency to the start of page rendering—an 
amount that is negligible in the context of browsing 
over a broadband connection. 


3. Our prototype integrates easily into the network and 
its existence is transparent to users. 
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Execution-based analysis does have limitations. We 
described several of these, including issues related to 
non-determinism, termination, and differences in the ex- 
ecution environment between the client and the proxy. 

There are many existing malware detection tools, and 
although none of them are perfect, together they con- 
tribute to a “defense in depth” security strategy. Our goal 
is neither to build a perfect tool nor to replace existing 
tools, but to add a new weapon to the Internet security 
arsenal. Overall, our prototype and experiments demon- 
strate the feasibility and value of on-the-fly, execution- 
based defenses against malicious Web-page content. 
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Abstract 


Voice over IP (VoIP) has become a popular protocol for 
making phone calls over the Internet. Due to the poten- 
tial transit of sensitive conversations over untrusted net- 
work infrastructure, it is well understood that the con- 
tents of a VoIP session should be encrypted. However, 
we demonstrate that current cryptographic techniques do 
not provide adequate protection when the underlying au- 
dio is encoded using bandwidth-saving Variable Bit Rate 
(VBR) coders. Explicitly, we use the length of encrypted 
VoIP packets to tackle the challenging task of identifying 
the language of the conversation. Our empirical analysis 
of 2,066 native speakers of 21 different languages shows 
that a substantial amount of information can be discerned 
from encrypted VoIP traffic. For instance, our 21-way 
classifier achieves 66% accuracy, almost a 14-fold im- 
provement over random guessing. For 14 of the 21 lan- 
guages, the accuracy is greater than 90%. We achieve 
an overall binary classification (e.g., “Is this a Spanish 
or English conversation?”) rate of 86.6%. Our analysis 
highlights what we believe to be interesting new privacy 
issues in VoIP. 


1 Introduction 


Over the last several years, Voice over IP (VoIP) has 
enjoyed a marked increase in popularity, particularly as 
a replacement of traditional telephony for international 
calls. At the same time, the security and privacy impli- 
cations of conducting everyday voice communications 
over the Internet are not yet well understood. For the 
most part, the current focus on VoIP security has centered 
around efficient techniques for ensuring confidentiality 
of VoIP conversations [3, 6, 14, 37]. Today, because of 
the success of these efforts and the attention they have re- 
ceived, it is now widely accepted that VoIP traffic should 
be encrypted before transmission over the Internet. Nev- 
ertheless, little, if any, work has explored the threat of 


traffic analysis of encrypted VoIP calls. In this paper, 
we show that although encryption prevents an eavesdrop- 
per from reading packet contents and thereby listening in 
on VoIP conversations (for example, using [21]), traffic 
analysis can still be used to infer more information than 
expected—namely, the spoken language of the conversa- 
tion. Identifying the spoken language in VoIP communi- 
cations has several obvious applications, many of which 
have substantial privacy ramifications [7]. 


The type of traffic analysis we demonstrate in this pa- 
per is made possible because current recommendations 
for encrypting VoIP traffic (generally, the application of 
length-preserving stream ciphers) do not conceal the size 
of the plaintext messages. While leaking message size 
may not pose a significant risk for more traditional forms 
of electronic communication such as email, properties of 
real-time streaming media like VoIP greatly increase the 
potential for an attacker to extract meaningful informa- 
tion from plaintext length. For instance, the size of an en- 
coded audio frame may have much more meaningful se- 
mantics than the size of a text document. Consequently, 
while the size of an email message likely carries little in- 
formation about its contents, the use of bandwidth-saving 
techniques such as variable bit rate (VBR) coding means 
that the size of a VoIP packet is directly determined by 
the type of sound its payload encodes. This informa- 
tion leakage is exacerbated in VoIP by the sheer num- 
ber of packets that are sent, often on the order of tens or 
hundreds every second. Access to such large volumes of 
packets over a short period of time allows an adversary to 
quickly estimate meaningful distributions over the packet 
lengths, and in turn, to learn information about the lan- 
guage being spoken. 

Identifying spoken languages is a task that, on the sur- 
face, may seem simple. However it is a problem that 
has not only received substantial attention in the speech 
and natural language processing community, but has also 
been found to be challenging even with access to full 
acoustic data. Our results show an encrypted conversa- 
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Figure 1: Uncompressed audio signal, Speex bit rates, and packet sizes for a random sample from the corpus. 


tion over VoIP can leak information about its contents, 
to the extent that an eavesdropper can successfully in- 
fer what language is being spoken. The fact that VoIP 
packet lengths can be used to perform any sort of lan- 
guage identification is interesting in and of itself. Our 
success with language identification in this setting pro- 
vides strong grounding for mandating the use of fixed 
length compression techniques in VoIP, or for requiring 
the underlying cryptographic engine to pad each packet 
to a common length. 

The rest of this paper is organized as follows. We be- 
gin in Section 2 by reviewing why and how voice over IP 
technologies leak information about the language spoken 
in an encrypted call. In Section 3, we describe our design 
for a classifier that exploits this information leakage to 
automatically identify languages based on packet sizes. 
We evaluate this classifier’s effectiveness in Section 4, 
using open source VoIP software and audio samples from 
a standard data set used in the speech processing com- 
munity. We review related work on VoIP security and 
information leakage attacks in Section 5, and conclude 
in Section 6. 


2 Information Leakage via Variable Bit 
Rate Encoding 


To highlight why language identification is possible in 
encrypted VoIP streams, we find it instructive to first re- 


view the relevant inner workings of a modern VoIP sys- 
tem. Most VoIP calls use at least two protocols: (/) 
a signaling protocol such as the Session Initiation Pro- 
tocol (SIP) [23] used for locating the callee and estab- 
lishing the call and (2) the Real Time Transport Proto- 
col (RTP) [25, 4] which transmits the audio data, en- 
coded using a special-purpose speech codec, over UDP. 
While several speech codecs are available (including 
G.711 [10], G.729 [12], Speex [29], and iLBC [2]), 
we choose the Speex codec for our investigation as 
it offers several advanced features like a VBR mode 
and discontinuous transmission, and its source code is 
freely available. Additionally, although Speex is not 
the only codec to offer variable bit rate encoding for 
speech [30, 16, 20, 35, 5], it is the most popular of those 
that do. 


Speex, like most other modern speech codecs, is based 
on code-excited linear prediction (CELP) [24]. In CELP, 
the encoder uses vector quantization with both a fixed 
codebook and an adaptive codebook [22] to encode a 
window of n audio samples as one frame. For example, 
in the Speex default narrowband mode, the audio input is 
sampled at 8kHz, and the frames each encode 160 sam- 
ples from the source waveform. Hence, a packet contain- 
ing one Speex frame is typically transmitted every 20ms. 
In VBR mode, the encoder takes advantage of the fact 
that some sounds are easier to represent than others. For 
example, with Speex, vowels and high-energy transients 
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Figure 2: Unigram frequencies of bit rates for English, Figure 3: Unigram frequencies of bit rates for Indonesian, 
Czech, Russian, and Mandarin 
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require higher bit rates than fricative sounds like “s” or 
“f” [28]. To achieve improved sound quality and a low 
(average) bit rate, the encoder uses fewer bits to encode 
frames which contain “easy” sounds and more bits for 
frames with sounds that are harder to encode. Because 
the VBR encoder selects the best bit rate for each frame, 
the size of a packet can be used as a predictor of the bit 
rate used to encode the corresponding frame. Therefore, 
given only packet lengths, it is possible to extract infor- 
mation about the underlying speech. Figure 1, for exam- 
ple, shows an audio input, the encoder’s bit rate, and the 
resulting packet sizes as the information is sent on the 
wire; notice how strikingly similar the last two cases are. 

As discussed earlier, by now it is commonly accepted 
that VoIP traffic should not be transmitted over the Inter- 
net without some additional security layer [14, 21]. In- 
deed, a number of proposals for securing VoIP have al- 
ready been introduced. One such proposal calls for tun- 
neling VoIP over IPSec, but doing so imposes unaccept- 
able delays on a real-time protocol [3]. An alternative, 
endorsed by NIST [14], is the Secure Real Time Trans- 
port Protocol (SRTP) [4]. SRTP is an extension to RTP 
and provides confidentiality, authenticity, and integrity 
for real-time applications. SRTP allows for three modes 
of encryption: AES in counter mode, AES in f8-mode, 
and no encryption. For the two stream ciphers, the stan- 
dard states that “in case the payload size is not an inte- 
ger multiple of (the block length), the excess bits of the 
key stream are simply discarded” [4]. Moreover, while 
the standard permits higher level protocols to pad their 
messages, the default in SRTP is to use length-preserving 
encryption and so one can still derive information about 
the underlying speech by observing the lengths of the en- 
crypted payloads. 

Given that the sizes of encrypted payloads are closely 
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related to bit rates used by the speech encoder, a perti- 
nent question is whether different languages are encoded 
at different bit rates. Our conjecture is that this is in- 
deed the case, and to test this hypothesis we examine 
real speech data from the Oregon Graduate Institute Cen- 
ter for Speech Learning and Understanding’s “22 Lan- 
guage” telephone speech corpus [15]. The data set con- 
sists of speech from native speakers of 21 languages, 
recorded over a standard telephone line at 8kHz. This 
is the same sampling rate used by the Speex narrowband 
mode. General statistics about the data set are provided 
in Appendix A. 

As a preliminary test of our hypothesis, we encoded 
all of the audio files from the CSLU corpus and recorded 
the sequence of bit rates used by Speex for each file. In 
narrowband VBR mode with discontinuous transmission 
enabled, Speex encodes the data set using nine distinct 
bit rates, ranging from 0.25kbps up to 24.6kbps. Fig- 
ure 2 shows the frequency for each bit rate for English, 
Brazilian Portuguese, German, and Hungarian. For most 
bit rates, the frequencies for English are quite close to 
those for Portuguese; but Portuguese and Hungarian ap- 
pear to exhibit different distributions. This results sug- 
gest that distinguishing Portuguese from Hungarian, for 
example, would be less challenging than differentiating 
Portuguese from English, or Indosesian from Russian 
(see Figure 3). 

Figures 4 and 5 provide additional evidence that bi- 
gram frequencies (i.e., the number of instances of con- 
secutively observed bit rate pairs) differ between lan- 
guages. The x and y axes of both figures specify ob- 
served bit rates. The density of the square (x,y) shows 
the difference in probability of bigram x, y between the 
two languages divided by the average probability of bi- 
gram x,y between the two. Thus, dark squares indicate 
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Figure 4: The normalized difference in bigram frequencies 
between Brazilian Portuguese (BP) and English (EN). 


significant differences between the languages for an ob- 
served bigram. Notice that while Brazilian Portuguese 
(BP) and English (EN) are similar, there are differences 
between their distributions (see Figure 4). Languages 
such as Mandarin (MA) and Tamil (TA) (see Figure 5), 
exhibit more substantial incongruities. 

Encouraged by these results, we applied the x? test to 
examine the similarity between sample unigram distribu- 
tions. The y? test is a non-parametric test that provides 
an indication as to the likelihood that samples are drawn 
from the same distribution. The y? results confirmed 
(with high confidence) that samples from the same lan- 
guage have similar distributions, while those from dif- 
ferent languages do not. In the next section, we explore 
techniques for exploiting these differences to automati- 
cally identify the language spoken in short clips of en- 
crypted VoIP streams. 


3. Classifier 


We explored several classifiers (e.g., using techniques 
based on k-Nearest Neighbors, Hidden Markov Models, 
and Gaussian Mixture Models), and found that a variant 
of a y? classifier provided a similar level of accuracy, 
but was more computationally efficient. In short, the 
x? classifier takes a set of samples from a speaker and 
models (or probability distributions) for each language, 
and classifies a speaker as belonging to the language for 
which the \? distance between the speaker’s model and 
the language’s model is minimized. To construct a lan- 
guage model, each speech sample (i.e., a phone call), is 
represented as a series of packet lengths generated by a 
Speex-enabled VoIP program. We simply count the n- 
grams of packet lengths in each sample to estimate the 
multinomial distribution for that model (for our empiri- 
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Figure 5: The normalized difference in bigram frequencies 
between Mandarin (MA) and Tamil (TA). 


cal analysis, we set n = 3). For example, if given a stream 
of packets with lengths of 55, 86,60,50 and 46 bytes, 
we would extract the 3-grams (55, 86, 60), (86, 60,50), 
(60, 50, 46), and use those triples to estimate the distri- 
butions!. We do not distinguish whether a packet repre- 
sents speech or silence (as it is difficult to do so with high 
accuracy), and simply count each n-gram in the stream. 


It is certainly the case that some n-grams will be more 
useful than others for the purposes of language separa- 
tion. To address this, we modify the above construc- 
tion such that our models only incorporate n-grams that 
exhibit low intraclass variance (i.e., the speakers within 
the same language exhibit similar distributions on the 
n-gram of concern) and high interclass variance (1.e., 
the speakers of one language have different distributions 
than those of other languages for that particular n-gram). 
Before explaining how to determine the distinguishabil- 
ity of a n-gram g, we first introduce some notation. As- 
sume we are given a set of languages, £. Let Py(g) de- 
note the probability of the n-gram g given the language 
L € £, and P,(g) denote the probability of the n-gram g 
given the speaker s € L. All probabilities are estimated 
by dividing the total number of occurrences of a given 
n-gram by the total number of observed n-grams. 


For the n-gram g we compute its average intraclass 
variability as: 


VARintra(g) = ig a a So (Ps(9) — Pr(g))? 


LEL sel 


Intuitively, this measures the average distance between 
the probability of g for given a speaker and the probabil- 
ity of g given that speaker’s language; i.e., the average 
variance of the probability distributions P(g). We com- 
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pute the interclass variability as: 


VAR inter (g) = 


(ve Se i) 


LEel 


SoS MS (Pla) = Pra(g))? 


In€L s€Ly LeELlL\L, 


This measures, on average, the difference between the 
probability of g for a given speaker and the probability 
of g given every other language. The second two sum- 
mations in the second term measure the distance from 
each speaker in a specific language to the means of all 
other languages. The first summation and the leading 
normalization term are used to compute the average over 
all languages. As an example, if we consider the seventh 
and eighth bins in the unigram case illustrated in Fig- 
ure 2, then VARinter(15.0 kbps) < VARinter(18.2 kbps). 

We set the overall distinguishability for n-gram g to 
be DIS(g) = VARinter(g)/VARintra(g). Intuitively, if 
DIS(g) is large, then speakers of the same language tend 
to have similar probability densities for g, and these den- 
sities will vary across languages. We choose to make our 
classification decisions using only those g with DIS(g) > 
1, we denote this set of distinguishing n-grams as G. The 
model for language L is simply the probability distribu- 
tion Py over G. 

To further refine the models, we remove outliers 
(speakers) who might contribute noise to each distribu- 
tion. In order to do this, we must first specify a distance 
metric between a speaker s and a language L. Suppose 
that we extract NV total n-grams from s’s speech samples. 
Then, we compute the distance between s and L as: 


(N- Pr(g) — N - Ps(g))? 


A(Ps,Pr,G) = > N - Px(g) 


gcEG 


We then remove the speakers s from L for which 
A(P;,Pi,G) is greater than some language-specific 
threshold t;. After we have removed these outliers, we 
recompute P;, with the remaining speakers. 

Given our refined models, our goal is to use a speaker’s 
samples to identify the speaker’s language. We assign the 
speaker s to the language with the model that is closest 
to the speaker’s distribution over G' as follows: 


L* = argmin A(P;, Pr, G) 
LeEL 


To determine the accuracy of our classifier, we apply 
the standard leave-one-out cross validation analysis to 
each speaker in our data set. That is, for a given speaker, 
we remove that speaker’s samples and use the remaining 


samples to compute G' and the models Py for each lan- 
guage in L € £. We choose the tz, such that 15% of the 
speakers are removed as outliers (these outliers are elim- 
inated during model creation, but they are still included 
in classification results). Next, we compute the probabil- 
ity distribution, P,, over G using the speaker’s samples. 
Finally, we classify the speaker using P, and the outlier- 
reduced models derived from the other speakers in the 
corpus. 


4 Empirical Evaluation 


To evaluate the performance of our classifier in a realis- 
tic environment, we simulated VoIP calls for many dif- 
ferent languages by playing audio files from the Oregon 
Graduate Institute Center for Speech Learning & Under- 
standing’s “22 Language” telephone speech corpus [15] 
over a VoIP connection. This corpus is widely used in 
language identification studies in the speech recognition 
community (e.g. [19], [33]). It contains recordings from 
a total of 2066 native speakers of 21 languages”, with 
over 3 minutes of audio per speaker. The data was orig- 
inally collected by having users call in to an automated 
telephone system that prompted them to speak about sev- 
eral topics and recorded their responses. There are sev- 
eral files for each user. In some, the user was asked to 
answer a question such as “Describe your most recent 
meal” or “What is your address?” In others, they were 
prompted to speak freely for up to one minute. This type 
of free-form speech is especially appealing for our eval- 
uation because it more accurately represents the type of 
speech that would occur in a real telephone conversation. 
In other files, the user was prompted to speak in English 
or was asked about the language(s) they speak. To avoid 
any bias in our results, we omit these files from our anal- 
ysis, leaving over 2 minutes of audio for each user. See 
Appendix A for specifics concerning the dataset. 

Our experimental setup includes two PC’s running 
Linux with open source VoIP software [17]. One of the 
machines acts as a server and listens on the network for 
SIP calls. Upon receiving a call, it automatically answers 
and negotiates the setup of the voice channel using Speex 
over RTP. When the voice channel is established, the 
server plays a file from the corpus over the connection 
to the caller, and then terminates the connection. The 
caller, which is another machine on our LAN, automati- 
cally dials the SIP address of the server and then “listens” 
to the file the server plays, while recording the sequence 
of packets sent from the server. The experimental setup 
is depicted in Figure 6. 

Although our current evaluation is based on data col- 
lected on a local area network, we believe that languages 
could be identified under most or all network conditions 
where VoIP is practical. First, RTP (and SRTP) sends in 
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Figure 6: Experimental setup. 


the clear a timestamp corresponding to the sampling time 
of the first byte in the packet data [25]. This timestamp 
can therefore be used to infer packet ordering and iden- 
tify packet loss. Second, VoIP is known to degrade sig- 
nificantly under undesirable network connections with 
latency more than a few hundred milliseconds [11], and 
it is also sensitive to packet loss [13]. Therefore any net- 
work which allows for acceptable call quality should also 
give our classifier a sufficient number of trigrams to make 
an accurate classification. 

For a concrete test of our techniques on wide-area 
network data, we performed a smaller version of the 
above experiment by playing a reduced set of 6 lan- 
guages across the Internet between a server on our LAN 
and a client machine on a residential DSL connection. 
In the WAN traces, we observed less than 1% packet 
loss, and there was no statistically significant difference 
in recognition rates for the LAN and WAN experiments. 


4.1 Classifier Accuracy 


In what follows, we examine the classifier’s performance 
when trained using all available samples (excluding, of 
course, the target user’s samples). To do so, we test each 
speaker against all 21 models. The results are presented 
in Figures 7 and 8. Figure 7 shows the confusion matrix 
resulting from the tests. The x axis specifies the language 
of the speaker, and the y axis specifies the language of 
the model. The density of the square at position (2, y) 
indicates how often samples from speakers of language 
x were classified as belonging to language y. 

To grasp the significance of our results, it is impor- 
tant to note that if packet lengths leaked no information, 
then the classification rates for each language would be 
close to random, or about 4.8%. However, the confusion 
matrix shows a general density along the y = = line. 
The classifier performed best on Indonesian (IN) which 


is accurately classified 40% of the time (an eight fold 
improvement over random guessing). It also performed 
well on Russian (RU), Tamil (TA), Hindi (HI), and Ko- 
rean (KO), classifying at rates of 35, 35, 29 and 25 per- 
cent, respectively. Of course, Figure 7 also shows that in 
several instances, misclassification occurs. For instance, 
as noted in Figure 2, English (EN) and Brazilian Por- 
tuguese (BP) exhibit similar unigram distributions, and 
indeed when misclassified, English was often confused 
with Brazilian Portuguese (14% of the time). Nonethe- 
less, we believe these results are noteworthy, as if VoIP 
did not leak information, the classification rates would 
be close to those of random guessing. Clearly, this is not 
the case, and our overall accuracy was 16.3%—that is, a 
three and a half fold improvement over random guessing. 

An alternative perspective is given in Figure 8, which 
shows how often the speaker’s language was among the 
classifier’s top x choices. We plot random guessing as a 
baseline, along with languages that exhibited the highest 
and lowest classification rates. On average, the correct 
language was among our top four speculations 50.2% of 
the time. Note the significant improvement over random 
guessing, which would only place the correct language 
in the top four choices approximately 19% of the time. 
Indonesian is correctly classified in our top three choices 
57% of the time, and even Arabic—the language with the 
lowest overall classification rates—was correctly placed 
among our top three choices 30% of the time. 

In many cases, it might be worthwhile to distinguish 
between only two languages, e.g., whether an encrypted 
conversation in English or Spanish. We performed tests 
that aimed at identifying the correct language when sup- 
plied only two possible choices. We see a stark improve- 
ment over random guessing, with seventy-five percent of 
the language combinations correctly distinguished with 
an accuracy greater than 70.1%; twenty-five percent had 
accuracies greater than 80%. Our overall binary classifi- 
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Confusion Matrix: All Languages 
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Figure 7: Confusion Matrix for 21-way test using tri- 
grams. Darkest and lightest boxes represent accuracies 
of 0.4 and 0.0, respectively. 


cation rate was 75.1%. 

Our initial intuition in Section 2 are strongly correlated 
to our empirical results. For example, rates for Russian 
versus Italian and Mandarin versus Tamil (see Figure 5) 
were 78.5% and 84%, respectively. The differences in 
the histograms shown earlier (Figure 2) also have direct 
implications for our classification rates in this case. For 
instance, our classifier’s accuracy when tasked with dis- 
tinguishing between Brazilian Portuguese and English 
was only 66.5%, whereas the accuracy for English versus 
Hungarian was 86%. 


4.2 Reducing Dimensionality to Improve 
Performance 


Although these results adequately demonstrate that 
length-preserving encryption leaks information in VoIP, 
there are limiting factors to the aforementioned approach 
that hinder classification accuracy. The primary diffi- 
culty arises from the fact that the classifier represents 
each speaker and language as a probability distribution 
over a very high dimensional space. Given 9 different 
observed packet lengths, there are 729 possible different 
trigrams. Of these possibilities, there are 451 trigrams 
that are useful for classification, i.e., DIS(g) > 1 (see 
Section 3). Thus, speaker and language models are prob- 
ability distributions over a 451-dimensional space. Un- 
fortunately, given our current data set of approximately 
7,277 trigrams per speaker, it is difficult to estimate den- 
sities over such a large space with high precision. 

One way to address this problem is based on the ob- 
servation that some bit rates are used in similar ways 
by the Speex encoder. For example, the two lowest bit 
rates, which result in packets of 41 and 46 bytes, re- 
spectively, are often used to encode periods of silence 
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Figure 8: CDF showing how often the speaker’s lan- 
guage was among the classifier’s top x choices. 


or non-speech. Therefore, we can reasonably consider 
the two smallest packet sizes functionally equivalent and 
put them together into a single group. In the same way, 
other packet sizes may be used similarly enough to war- 
rant grouping them together as well. We experimented 
with several mappings of packet sizes to groups, but 
found that the strongest results are obtained by mapping 
the two smallest packet lengths together, mapping all of 
the mid-range packet lengths together, and leaving the 
largest packet size in a group by itself. 


We assign each group a specific symbol, s, and then 
compute n-grams from these symbols instead of the orig- 
inal packet sizes. So, for example, given the sequence of 
packet lengths 41,50, 46, and 55, we map 41 and 46 to 
81 and 50 and 55 to se to extract the 3-grams (81, $2, 51) 
and (8s, 1, 82), etc. Our classification process then con- 
tinues as before, except that the reduction in the num- 
ber of symbols allows us to expand our analysis to 4- 
grams. After removing the 4-grams g with DIS(g) < 1, 
we are left with 47 different 4-gram combinations. Thus, 
we reduced the dimensionality of the points from 451 
to 47. Here we are estimating distributions over a 47- 
dimensional space using on average of 7,258 4-grams per 
speaker. 


Results for this classifier are shown in Figures 9 and 
10. With these improvements, the 21-way classifier cor- 
rectly identifies the language spoken 66% of the time— 
a fourfold improvement over our original classifier and 
more than 13 times better than random guessing. It rec- 
ognizes 14 of the 21 languages exceptionally well, iden- 
tifying them with over 90% accuracy. At the same time, 
there is a small group of languages which the new clas- 
sifier is not able to identify reliably; Czech, Spanish, and 
Vietnamese are never identified correctly on the first try. 
This occurs mainly because the languages which are not 
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Figure 9: Confusion Matrix for the 21-way test using 4- 
grams and reduced set of symbols. Darkest and lightest 
boxes represent accuracies of 1.0 and 0.0, respectively. 


recognized accurately are often misidentified as one of 
a handful of other languages. Hungarian, in particular, 
has false positives on speakers of Arabic, Czech, Span- 
ish, Swahili, Tamil, and Vietnamese. These same lan- 
guages are also less frequently misidentified as Brazilian 
Portuguese, Hindi, Japanese, Korean, or Mandarin. In 
future work, we plan to investigate what specific acous- 
tic features of language cause this classifier to perform so 
well on many of the languages while failing to accurately 
recognize others. 


Binary classification rates, shown in Figure 11 and 
Table 1, are similarly improved over our initial results. 
Overall, the classifier achieves over 86% accuracy when 
distinguishing between two languages. The median ac- 
curacy is 92.7% and 12% of the language pairs can be 
distinguished at rates greater than 98%. In a few cases 
like Portuguese versus Korean or Farsi versus Polish, the 
classifier exhibited 100% accuracy on our test data. 


Interestingly, the results of our classifiers are compara- 
ble to those presented by Zissman [38] in an early study 
of language identification techniques using full acous- 
tic data. Zissman implemented and compared four dif- 
ferent language recognition techniques, including Gaus- 
sian mixture model (GMM) classification and techniques 
based on single-language phone recognition and n-gram 
language modeling. All four techniques used cepstral co- 
efficients as input [22]. 


The GMM classifier described by Zissman is much 
simpler than the other techniques and serves primarily 
as a baseline for comparing the performance of the more 
sophisticated methods presented in that work. Its accu- 
racy is quite close to that of our initial classifier: with 
access to approximately 10 seconds of raw acoustic data, 
it scored approximately 78% for three language pairs, 
compared to our classifer’s 89%. The more sophisti- 
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Figure 10: CDF showing how often the speaker’s lan- 
guage was among the classifier’s top x choices using 4- 
grams and reduced set of symbols. 


cated classifiers in [38] have performance closer to that 
of our improved classifier. In particular, an 11-way clas- 
sifier based on phoneme recognition and n-gram lan- 
guage modeling (PRLM) was shown to achieve 89% ac- 
curacy when given 45s of acoustic data. In each case, 
our classifier has the advantage of a larger sample, using 
around 2 minutes of data. 

Naturally, current techniques for language identifi- 
cation have improved on the earlier work of Zissman 
and others, and modern error rates are almost an order 
of magnitude better than what our classifiers achieve. 
Nevertheless, this comparison serves to demonstrate the 
point that we are able to extract significant information 
from encrypted VoIP packets, and are able to do so with 
an accuracy close to a reasonable classifier with access 
to acoustic data. 


DISCUSSION 


We note that since the audio files in our corpus were 
recorded over a standard telephone line, they are sampled 
at 8kHz and encoded as 16-bit PCM audio, which is ap- 
propriate for Speex narrowband mode. While almost all 
traditional telephony samples the source audio at 8kHz, 
many soft phones and VoIP codecs have the ability to use 
higher sampling rates such as 16kHz or 32kHz to achieve 
better audio quality at the tradeoff of greater load on the 
network. Unfortunately, without a higher-fidelity data 
set, we have been unable to evaluate our techniques on 
VoIP calls made with these higher sampling rates. Nev- 
ertheless, we feel that the results we derive from using 
the current training set are also informative for higher- 
bandwidth codecs for two reasons. 

First, it is not uncommon for regular phone conver- 
sations to be converted to VoIP, enforcing the use of an 
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Figure 11: CCDF for overall accuracy of the binary clas- 
sifier using 4-grams and reduced set of symbols. 


8kHz sampling rate. Our test setup accurately models the 
traffic produced under this scenario. Second, and more 
importantly, by operating at the 8kHz level, we argue 
that we work with Jess information about the underly- 
ing speech, as we are only able to estimate bit rates up 
to a limited fidelity. Speex wideband mode, for example, 
operates on speech sampled at 16kHz and in VBR mode 
uses a wider range of bit rates than does the narrowband 
mode. With access to more distinct bit rates, one would 
expect to be able to extract more intricate characteristics 
about the underlying speech. In that regard, we believe 
that our results could be further improved given access to 
higher-fidelity samples. 


4.3 Mitigation 


Recall that these results are possible because the default 
mode of encryption in SRTP is to use a length-preserving 
stream cipher. However, the official standard [4] does al- 
low implementations to optionally pad the plaintext pay- 
load to the next multiple of the cipher’s block size, so that 
the original payload size is obscured. Therefore, we in- 
vestigate the effectiveness of padding against our attack, 
using several block sizes. 

To determine the packet sizes that would be pro- 
duced by encryption with padding, we simply modify 
the packet sizes we observed in our network traces by 
increasing their RTP payload sizes to the next multiple 
of the cipher’s block size. To see how our attack is af- 
fected by this padding, we re-ran our experiments us- 
ing block sizes of 128, 192, 256, and 512 bits. Padding 
to a block size of 128 bits results in 4 distinct packet 
sizes; this number decreases to 3 distinct sizes with 192- 
bit blocks, 2 sizes with 256-bit blocks, and finally, with 
512-bit blocks, all packets are the same size. Figure 12 
shows the CDF for the classifier’s results for these four 











Lang. Acc Lang. Acc 

EN-FA 0.980 | CZ-JA 0.544 
GE-RU_ 0.985 | AR-SW_ 0.549 
FA-SD 0.990 | CZ-HU 0.554 
IN-PO 0.990 | CZ-SD 0.554 
PO-RU- 0.990 | MA-VI 0.565 
BP-PO- 0.995 | JA-SW 0.566 
EN-HI 0.995 | HU-VI = 0.575 
HI-PO = 0.995 | CZ-MA__ 0.580 
BP-KO 1.000 | CZ-SW_ 0.590 
FA-PO 1.000 | HU-TA 0.605 














Table 1: Binary classifier recognition rates for selected 
language pairs. Languages and their abbreviations are 
listed in Appendix A. 


cases, compared to random guessing and to the results 
we achieve when there is no padding. 


Padding to 128-bit blocks is largely ineffective be- 
cause there is still sufficient granularity in the packet 
sizes that we can map them to basically to the same three 
bins used by our improved classifier in Section 4.2. Even 
with 192- or 256-bit blocks, where dimensionality reduc- 
tion does not offer substantial improvement, the correct 
language can be identified on the first guess over 27% of 
the time—more than 5 times better than random guess- 
ing. It is apparent from these results that, for encryp- 
tion with padding to be an effective defense against this 
type of information leakage, the block size must be large 
enough that all encrypted packets are the same size. 


Relying on the cryptographic layer to protect against 
both eavesdropping and traffic analysis has a certain 
philosophical appeal because then the compression layer 
does not have to be concerned with security issues. On 
the other hand, padding incurs significant overhead in the 
number of bytes that must be transmitted. Table 2 lists 
the increase in traffic volume that arises from padding to 
each block size, as well as the improvement of the overall 
accuracy of the classifier over random guessing. 


Another solution for ensuring that there is no infor- 
mation leakage is to use a constant bit rate codec, such 
as Speex in CBR mode, to send packets of fixed length. 
Forcing the encoder to use a fixed number of bits is an 
attractive approach, as the encoder could use the bits that 
would otherwise be used as padding to improve the qual- 
ity of the encoded sound. While both of these approaches 
would detract from the bandwidth savings provided by 
VBR encoders, they provide much stronger privacy guar- 
antees for the participants of a VoIP call. 
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Block Size | Overhead | Accuracy Poe 
vs Random 
none 0.0% 66.0% 13.8x 
128 bits 8.7% 62.5% 13.0x 
192 bits 13.8% 27.1% 5.7x 
256 bits 23.9% 27.2% 5.7x 
512 bits 42.2% 6.9% 1.4x 

















Table 2: Tradeoff of effectiveness versus overhead in- 
curred for padding VoIP packets to various block sizes. 


5 Related Work 


Some closely related work is that of Wang et al. [31] on 
tracking VoIP calls over low-latency anonymizing net- 
works such as Tor [9]. Unlike our analysis, which is en- 
tirely passive, the attack in [31] requires that the attacker 
be able to actively inject delays into the stream of pack- 
ets as they traverse the anonymized network. Other re- 
cent work has explored extracting sensitive information 
from several different kinds of encrypted network con- 
nections. Sun et al. [27], for example, examined World 
Wide Web traffic transmitted in HTTP over secure (SSL) 
connections and were able to identify a set of sensitive 
websites based on the number and sizes of objects in 
each encrypted HTTP response. Song et al. [26] used 
packet interarrival times to infer keystroke patterns and 
ultimately crack passwords typed over SSH. Zhang and 
Paxson [36] also used packet timing in SSH traffic to 
identify pairs of connections which form part of a chain 
of “stepping stone” hosts between the attacker and his 
eventual victim. In addition to these application-specific 
attacks, our own previous work demonstrates that packet 
size and timing are indicative of the application protocol 
used in SSL-encrypted TCP connections and in simple 
forms of encrypted tunnels [34]. 


Techniques for autmatically identifying spoken lan- 
guages were the subject of a great deal of work in the 
mid 1990’s [18, 38]. While these works used a wide 
range of features extracted from the audio data and em- 
ployed many different machine learning techniques, they 
all represent attempts to mimic the way humans differ- 
entiate between languages, based on differences in the 
sounds produced. Because our classifier does not have 
direct access to the acoustic data, it is unrealistic to ex- 
pect that it could outperform a modern language recog- 
nition system, where error rates in the single digits are 
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Figure 12: The effect of padding on classifier accuracy. 


not uncommon. Nevertheless, automatic language iden- 
tification is not considered a solved problem, even with 
access to full acoustic data, and work is ongoing in the 
speech community to improve recognition rates and ex- 
plore new approaches (see, e.g., [32, 8, 1]). 


6 Conclusions 


In this paper, we show that despite efforts devoted to se- 
curing conversations that traverse Voice over IP, an ad- 
versary can still exploit packet lengths to discern con- 
siderable information about the underlying spoken lan- 
guage. Our techniques examine patterns in the output of 
Variable Bit Rate encoders to infer characteristics of the 
encoded speech. Using these characteristics, we evalu- 
ate our techniques on a large corpus of traffic from dif- 
ferent speakers, and show that our techniques can clas- 
sify (with reasonable accuracy) the language of the tar- 
get speaker. Of the 21 languages we evaluated, we are 
able to correctly identify 14 with accuracy greater than 
90%. When tasked with distinguishing between just two 
languages, our average accuracy over all language pairs 
is greater than 86%. These recognition rates are on par 
with early results from the language identification com- 
munity, and they demonstrate that variable bit rate cod- 
ing leaks significant information. Moreover, we show 
that simple padding is insufficient to prevent leakage of 
information about the language spoken. We believe that 
this information leakage from encrypted VoIP packets is 
a significant privacy concern. Fortunately, we are able to 
suggest simple remedies that would thwart our attacks. 


Acknowledgments 


We thank Scott Coull for helpful conversations through- 
out the course of this research, as well as for pointing out 





52 


16th USENIX Security Symposium 


USENIX Association 


the linphone application [17]. We also thank Patrick Mc- 
Daniel and Patrick Traynor for their insightful comments 


on early versions of this work. This work was funded in 
part by NSF grants CNS-0546350 and CNS-0430338. 


Notes 


'Note that our classifier is not a true instance of a y? classifier 
as the probability distributions over each n-gram are not indepedent. 
Essentially, we just use the x? function as a multi-dimensional distance 
metric. 

2Due to problems with the data, recordings from the French speak- 
ers are unavailable. 
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A. Data Set Breakdown 


The empirical analysis performed in this paper is based 
on one of the most widely used data sets in the lan- 
guage recognition community. The Oregon Graduate In- 
stitute CSLU 22 Language corpus provides speech sam- 
ples from 2,066 native speakers of 21 distinct languages. 
Indeed, the work of Zissman [38] that we analyze in Sec- 
tion 4 used an earlier version of this corpus. Table 3 pro- 


vides some statistics about the data set. 




















Language Abbr. | Speakers a. 
Arabic AR 100 2.16 
Br. Portuguese | BP 100 2:52 
Cantonese CA 93 2.63 
Czech CZ 100 2.02 
English EN 100 2.51 
Farsi FA 100 2.57 
German GE 100 2.33 
Hindi HI 100 2.74 
Hungarian HU 100 2.81 
Indonesian IN 100 2.45 
Italian IT 100 2.25 
Japanese JA 100 2.33 
Korean KO 100 2.58 
Mandarin MA 100 2.75 
Polish PO 100 2.64 
Russian RU 100 2.55 
Spanish SP 100 2.76 
Swahili SW 73 2.26 
Swedish SD 100 2.23 
Tamil TA 100 2.12 
Vietnamese VI 100 1.96 











Table 3: Statistics about each language in our data 
set [15]. Minutes of speech is measured how many of 
minutes of speech we used during our tests. 
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Abstract 


We analyze three new consumer electronic gadgets in 
order to gauge the privacy and security trends in mass- 
market UbiComp devices. Our study of the Slingbox Pro 
uncovers a new information leakage vector for encrypted 
streaming multimedia. By exploiting properties of vari- 
able bitrate encoding schemes, we show that a passive 
adversary can determine with high probability the movie 
that a user is watching via her Slingbox, even when the 
Slingbox uses encryption. We experimentally evaluated 
our method against a database of over 100 hours of net- 
work traces for 26 distinct movies. 

Despite an opportunity to provide significantly more 
location privacy than existing devices, like RFIDs, we 
find that an attacker can trivially exploit the Nike+iPod 
Sport Kit’s design to track users; we demonstrate this 
with a GoogleMaps-based distributed surveillance sys- 
tem. We also uncover security issues with the way Mi- 
crosoft Zunes manage their social relationships. 

We show how these products’ designers could have 
significantly raised the bar against some of our attacks. 
We also use some of our attacks to motivate fundamen- 
tal security and privacy challenges for future UbiComp 
devices. 


Keywords: Information leakage, variable bitrate (VBR) 
encoding, encryption, multimedia security, privacy, loca- 
tion privacy, mobile social applications, UbiComp. 


1 Introduction 


As technology continues to advance, computational de- 
vices will increasingly permeate our everyday lives, plac- 
ing more and more wireless computers into our environ- 
ment and onto us. Many manufactures have predicted 
that the increasing capabilities and decreasing costs of 
wireless radios will enable common electronics in fu- 
ture homes to be predominantly wireless, eliminating the 
clutter of wires common in today’s homes. For exam- 
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ple, TVs, cable boxes, speakers, and DVD players could 
communicate without the proximity restrictions of wires. 
The changing technological landscape will also lead to 
new computing devices, such as personal health moni- 
tors, for us to wear on our persons as we move around 
our community. However, despite advances in these ar- 
eas we have only just begun to see the first examples of 
such technologies enter the marketplace at a broad scale. 
While the Ubiquitous Computing (UbiComp) revolution 
will have many positive aspects, we must be careful to 
not simultaneously endanger users’ privacy or security. 


By studying the Sling Media Slingbox Pro, the 
Nike+iPod Sport Kit, and the Microsoft Zune, we pro- 
vide a checkpoint of current industrial trends regarding 
the privacy and security of this new generation of Ubi- 
Comp devices. (The Slingbox Pro is a video relay sys- 
tem; the Nike+iPod Sport Kit is a wireless exercise ac- 
cessory for the iPod Nano; and the Zune is a portable 
wireless media player.) In some cases, such as our 
techniques for inferring information about what movie 
a user is watching from 10 minutes of a Slingbox Pro’s 
encrypted transmissions, we present new directions for 
computer security research. For some of our other re- 
sults, such as the Nike+iPod’s use of a globally unique 
persistent identifier, the key privacy issues that we un- 
cover are not new; but the ease with which we are able to 
mount our attacks is surprising. This is particularly true 
because we show that it would have been technically pos- 
sible for the Nike+iPod designers to prevent our attacks. 


In all cases, we use our results with these devices to 
paint a set of research challenges that future commer- 
cial UbiComp devices should address in order to provide 
users’ with strong levels of privacy and security. 


On Our Choice of Devices. The Slingbox Pro, the 
Nike+iPod Sport Kit, and the Microsoft Zune represent 
a cross-section of the different classes of UbiComp de- 
vices one might encounter in the future: (1) devices that 
permeate our environment and that stream or exchange 
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Figure 1: 5 second and 100 millisecond throughput for 
two traces of Ocean’s Eleven played via the Slingbox and 
captured via a wired connection. Notice the (visual) sim- 
ilarity between the traces. 


information; (2) devices that users have on their persons 
all the time; and (3) devices that promote social interac- 
tions. While there is no perfect division between these 
different classes of devices (e.g., devices that users have 
on themselves all the time may also exchange content 
and promote social activity), there are unique aspects to 
the challenges for each class of devices; we therefore 
consider each in turn. Specifically, (1) we use the Sling- 
box Pro as a vehicle to study the issues and challenges 
affecting next-generation wireless multimedia environ- 
ments, (2) we use the Nike+iPod Sport Kit as the ba- 
sis for assessing the issues and challenges with devices 
that we have on our persons all the time, and (3) we use 
the Zune as a foothold into understanding the issues and 
challenges with devices promoting social activity. 

Below we survey our results and challenges for each 
of these scenarios in turn, deferring further details to the 
body of this paper. 


1.1 The Sling Media Slingbox Pro 


The Slingbox Pro allows users to remotely view (sling) 
the contents of their TV over the Internet. The makers of 
the Slingbox Pro are staged to introduce a new device, 
the wireless SlingCatcher, which will allow Slingbox 
users to sling video to other TVs located within the same 
home, thereby making it one of the first next-generation 
wireless video multimedia systems for the home [40]. 
Since the SlingCatcher will not be commercially avail- 
able until later this year, we choose to study the privacy- 
preserving properties of a Slingbox streaming encrypted 
movies to a nearby computer over 802.11 wireless. 

We describe in the following sections a technique for 
monitoring a network connection, wired or wireless, and 
based on the rate at which data is being sent from one 
device to the other, predicting the content that is being 
transferred. Our method consists of two parts. First, 
we describe a procedure for collecting throughput traces 


across wired and wireless connections and combining 
them into a single reference trace per movie. These refer- 
ence traces are collected into a database for future query 
use. (Figure 1 shows the raw 5 second and 100 mil- 
lisecond throughput data for two wired traces of Ocean’s 
Eleven.) Second, we describe a simple Discrete Fourier 
Transform based matching algorithm for querying this 
database and predicting the content being transmitted. 

We test this algorithm on a dataset consisting of over 
100 hours of network throughput data. With only 10 min- 
utes worth of monitoring data, we are able to predict with 
62% accuracy the movie that is being watched (on aver- 
age over all movies); this compares favorably with the 
less than 4% accuracy that one would achieve by random 
chance. With 40 minutes worth of monitoring data, we 
are able to predict the movie with 77% accuracy. For 
certain movies we can do significantly better; for 15 out 
of the 26 movies, given a 40 minute trace we are able to 
predict the correct movie with over 98% accuracy. Given 
the simplicity of our algorithm, this indicates a signifi- 
cant amount of information leakage — a fact that is not 
immediately obvious to the users, who likely trust the 
built in encryption in the device to protect privacy. 

Any transmission method whose characteristics de- 
pend on the content that is being transmitted is suscepti- 
ble to the kind of attack we have described. As the world 
moves towards more advanced multimedia compression 
methods, and streaming media becomes ubiquitous, vari- 
able bitrate encoding is here to stay. Preventing informa- 
tion leakage in variable bitrate streams without a signif- 
icant performance penalty is an interesting challenge for 
both the signal processing and the security communities. 
More broadly, a fundamental challenge that we must ad- 
dress is how to identify, understand, and mitigate infor- 
mation leakage channels in the full range of upcoming 
UbiComp devices. 


1.2 The Nike+iPod Sport Kit 


The Nike+iPod Sport Kit is a new wireless accessory for 
the iPod Nano; see Figure 2. The kit consists of two 
components — a wireless sensor that a user puts in one 
of her shoes and a receiver that she attaches to her iPod 
Nano. When the user walks or runs, the sensor wire- 
lessly transmits information to the receiver; the receiver 
and iPod will then interpret that information and provide 
interactive audio feedback to the user about her work- 
out. The Nike+iPod sensor does have an on-off button, 
but the online documentation suggests that most users 
should leave their sensors in the on position. Moreover, 
since the Nike+iPod online documentation encourages 
users to “just drop the sensor in their Nike+ shoes and 
forget about it [36],” the Nike+iPod Sport Kit is a prime 
example of the types of devices that people might even- 
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Figure 2: A Nike+iPod sensor in a Nike+ shoe and a 
Nike+iPod receiver connected to an iPod Nano. 


tually find on themselves all the time. 


One well-known potential privacy risk with having 
wireless devices on ourselves all the time is: if the de- 
vices use unique identifiers when they communicate, and 
if someone can intercept (sniff) those unique identifiers 
from the communications, then that someone might learn 
potentially private information about a user’s presence 
or location. This someone might use that information in 
ways that are not in a user’s best interest; e.g., a stalker 
might use this information to digitally track one or many 
people, a company might use this information for tar- 
geted advertising, and a court might examine this infor- 
mation when debating a contentious case. Location and 
tracking issues such as these are broadly discussed in the 
context of RFIDs [27], bluetooth devices [26, 44], and (to 
a lesser extent) 802.11 wireless devices [15], and there is 
a large body of UbiComp literature focused on privacy in 
location-aware systems [5, 11, 12, 19, 20, 25, 22, 29, 34]. 
Given this broad awareness of the potential trackabil- 
ity issues with wireless devices, and given media reports 
that the Nike+iPod Sport Kit used a proprietary wireless 
protocol [35] we set out to determine whether the new 
Nike+iPod Sport Kit proprietary system “raised the bar” 
against parties wishing to track users’ locations. 


We describe the technical process that we went 
through in order to discover the Nike+iPod Sport Kit 
protocol in Section 3. The key discovery we found is 
that not only does each Nike+iPod sensor have a glob- 
ally unique identifier, but we can cheaply and easily de- 
tect the transmissions from the Nike+iPod shoe sensors 
from 10-20 meters away — an order of magnitude fur- 
ther than what one would expect from a wireless de- 
vice that only needs to communicate from a user’s shoe 
to the user’s iPod (typically strapped around the user’s 
arm), and also significantly further than the conventional 
passive RFID. The Nike+iPod sensor also broadcasts its 
unique identifier even when there are no iPods nearby 
— the user must simply be moving with a Nike+iPod 
sensor in her shoe. To illustrate the ease with which 
one could create a Nike+iPod tracking system, we devel- 





Figure 3: (a) A gumstix-based Nike+iPod surveillance 
device with wireless Internet capabilities. (b) Our 
Nike+iPod Receiver to USB adapter 


oped a network of Nike+iPod surveillance devices, in- 
cluding a $250 gumstix-based node. The gumstix uses 
an 802.11 wireless Internet connection to dynamically 
stream surveillance data to our back-end server, which 
then displays the surveillance data in a GoogleMaps ap- 
plication in real time. 

We then describe cryptographic mechanisms that, if 
implemented, would have significantly improved the 
Nike+iPod Sport Kit’s resistance to our tracking attacks, 
albeit with the potential drawback of additional resource 
consumption (e.g., battery life and communication over- 
head). Our basic approach is to mask the unique iden- 
tifiers so that only the intended recipient can unmask 
them. Our solution, however, exploits the fact that 
the Nike+iPod Sport Kit has a very simplistic commu- 
nications topology — at any given time a Nike+iPod 
Sport Kit sensors only needs to be able to communicate 
with one receiver. The challenge is therefore to lift our 
privacy-preserving mechanisms (or other mechanisms) 
to a broader context with heterogeneous devices commu- 
nicating in an ad hoc mannet. 


1.3 The Microsoft Zune 


The Microsoft Zune is a portable digital media player 
with one (currently) unusual feature: built in 802.11 
wireless capabilities. The intended goal is to let users 
wirelessly share pictures and songs with other nearby 
Zunes — including Zunes belonging to total strangers. 
As such, the Zune is arguably the first major commer- 
cial device with the design goal of helping catalyze ad 
hoc social interactions in a peer-to-peer wireless envi- 
ronment. (Strictly speaking, we have not read the Zune 
design documents. Rather, we are inferring this design 
goal from articles in the popular press and from other 
publicly available information about the Zune [32].) 
Unfortunately, just as it is possible for spammers to 
send unsolicited or inappropriate emails to users, it is 
possible for an attacker to beam unsolicited content to 
a nearby Zune. This unsolicited content may be annoy- 
ing, such as advertisements or propaganda, or malicious, 
such as images or songs that might make the recipient 
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feel uncomfortable or unsafe. 

Given the Zune’s goal of enabling ad hoc interactions, 
the Zune cannot fall back on traditional mechanisms for 
preventing unsolicited content, such as buddy lists for in- 
stant messaging. Further, much of the research on social 
interactions for ubiquitous devices is restricted to scenar- 
ios where users have a hierarchy of social relationships 
(e.g., friends and non-friends) [22], which is incompati- 
ble with the assumed Zune design goals. Rather, in ap- 
parent anticipation of such unsolicited content, the Mi- 
crosoft Zune allows users to “block” a particular device 
— a malicious individual might be able to get a user to 
accept an image or song once, but the recipient should 
be able to block the originating device from ever send- 
ing the user other content in the future. Unfortunately, 
we find that it is easy for an adversary to subvert this 
blocking mechanism, thereby allowing the adversary to 
repeatedly initiate content pushes to the victim until the 
victim walks out of range or turns off the wireless in her 
Zune. While we describe techniques that would address 
the above scenario in the particular case of the Zune, 
the observations we make underscore two challenges for 
UbiComp devices designed to enable ad hoc social in- 
teractions: (1) how to technically implement a blocking 
procedure or proactively protect against undesired con- 
tent, especially among a set of heterogeneous devices, 
and (2) how to balance the blocking mechanisms with 
our desire to protect location privacy and avoid certain 
uses of globally unique identifiers. 


1.4 Organization and Remarks 


We respectively discuss our analyses of the Slingbox Pro, 
the Nike+iPod Sport Kit, and the Microsoft Zune, as well 
as the associated research challenges, in Sections 2, 3, 
and 4. We discuss related work in-line. 

We stress that there is no evidence that Sling Media, 
Apple, Nike, or Microsoft intended for any of these de- 
vices to be used in any malicious manner. Neither Sling 
Media, Apple, Nike, nor Microsoft endorsed this study. 


2 The Slingbox Pro: Information Leakage 
and Variable Bitrate Encoding 


Although the future of home entertainment is somewhat 
fuzzy, many companies have predicted the future home 
to be a wireless one. Wireless devices tend to be easier 
to install (though not necessarily easier to setup), pro- 
vide the user with more flexibility, allow the devices to 
interoperate with other technologies, and reduce clutter 
from wires. While it is currently easier to simply plug 
these devices in once and forget about them, future wire- 
less technologies promise an ever increasing amount of 
bandwidth, range, and decreasing manufacturing costs, 


making them more appealing and more likely to be in- 
cluded in future products. Consider, for example, the 
buzz associated with the upcoming SlingCatcher and the 
Apple TV; the former is expected to feature integrated 
wireless support; the latter currently does. In addition to 
the drive for devices to be connected together, wirelessly, 
in the home, these devices are often finding themselves 
networked together and connected to the Internet. 


Protecting our private information becomes increas- 
ingly difficult as we begin to continually use more wire- 
less devices. Devices in our homes could leak pri- 
vate information to wireless eavesdroppers or, when us- 
ing home devices over the Internet, wired eavesdrop- 
pers. We have investigated one such new wireless/remote 
TV viewing application — the Slingbox Pro — from a 
privacy standpoint. In doing so we have uncovered a 
new information leakage vector for encrypted multime- 
dia systems via variable bitrate encoding. 


2.1 Slingbox Pro Description 


The Slingbox Pro is a networked video streaming de- 
vice built by Sling Media, Inc. It is capable of stream- 
ing video using its built in TV tuner or one of four in- 
puts connected to DVD players, cable TV, personal video 
recorders, built in TV tuner, etc. and controlling these 
devices using an IR emitter. The device itself has no 
hard drive and cannot store media locally, relying on the 
connected devices to provide the video and audio con- 
tent. Paired with player software, called SlingPlayer, the 
user can watch video streamed by the Slingbox Pro on 
their laptop, desktop, or PDA anywhere they have Inter- 
net access. To accommodate limited network connec- 
tions when watching videos over a wireless network or 
away from home, the Slingbox Pro re-encodes the video 
stream using a variable bitrate encoder, likely a opti- 
mized version of Windows Media 9s VC-1 implemen- 
tation [41]. The Slingbox Pro provides encryption for its 
data stream (regardless of any transport encryption like 
WPA). To avoid any problems caused by latency or net- 
work interruption the SlingPlayer will cache a buffer of 
several seconds worth of video. Because of this caching 
behavior and commonly used packet sizes for TCP pack- 
ets, the data packets from the Slingbox Pro tend to always 
be large data packets of similar size or small (seemingly 
control) packets. 

Sling Media recently announced a new device, the 
wireless SlingCatcher, which users can attach to their 
TVs. The SlingCatcher would allow users to wirelessly 
stream content from a Slingbox Pro to their TVs, thereby 
taking us one step further to a wireless multimedia home. 
Since the SlingCatcher is not yet commercially available, 
we choose to study the Slingbox Pro in isolation. 
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Table 1: Mapping from movie names to movie indices. 


2.2 Experimental Setup 


We ask whether Slingbox’s use of encryption prevents 
an eavesdropper from discovering what content is being 
transmitted. This private information could be poten- 
tially sensitive if the content is illegal (e.g., pirated), em- 
barrassing, or is otherwise associated with some social 
stigma. Toward answering this question, we conducted 
the following experiments. 

We streamed a total of 26 movies from a Slingbox Pro 
to laptop and desktop Windows XP computers running 
the Slingmedia SlingPlayer. See Table 1. For each movie 
we streamed the first hour of the movie twice over a 
wired connection and twice over an 802.11G WPA-PSK 
TKIP wireless connection. Each time we used the Wire- 
shark protocol analyzer [43] to capture all of the Sling- 
box encrypted packets to a file. We split each of these 
traces into 100-millisecond segments and calculate the 
data throughput for each segment. We use these 100- 
millisecond throughput traces as the basis for our eaves- 
dropping analysis. See Figure 1 for two examples of 
these 100-millisecond traces, as well as two example 5- 
second throughput traces. 


2.3. Throughput Analyses 


Our eavesdropping algorithm consists of two parts. In 
the first part, we construct a database of reference traces. 
Each movie was represented by exactly one reference 
trace obtained by combining all the throughput traces 
corresponding to it. Each reference trace requires ap- 
proximately 600 kilobytes of storage per hour of video. 
The second part of our algorithm uses this database of 
reference traces to match against a previously unseen 
trace. In the following we describe each of these two 
stages in detail. 


Building a Database of Reference Traces. While it is 
possible to use our matching algorithm against individ- 
ual raw traces, combining the raw traces for a movie into 
one reference trace, reduces the time complexity of the 
matching process and increases the statistical robustness 
of the matching procedure by eliminating noise and net- 
work effects peculiar to a particular trace. 

For each movie, all its traces were temporally aligned 
with each other. This is needed because the trace cap- 
turing process was started manually and the traces could 
be offset in time by 0 to 20 seconds. The alignment was 
done by looking at the maximum of the normalized cross 
correlation between smoothed versions of the traces. The 
smoothing was performed using Savitzky-Golay filtering 
of degree 2 and windowsize 300. These filters perform 
smoothing while preserving high frequency content bet- 
ter than standard averaging filters [38]. The reference 
trace was obtained by averaging over the aligned raw sig- 
nals. 


Matching a Query Trace to the Database. Given a 
database of reference traces and a short throughput trace, 
we are now faced with the task of finding the best match- 
ing reference trace. This is an instance of the problem 
of subsequence matching in databases, which has been 
widely studied in both discrete and continuous domains. 
Our algorithm is inspired by the work of Faloutsos et 
al. [13]. 

The simplest approach to subsequence matching in 
timeseries is to calculate the Euclidean distance between 
the query sequence and all contiguous subsequences of 
the same size in the database. Due to the amount of noise 
present in these traces, this method does not perform well 
in practice. Following Faloutsos et al., instead of com- 
paring raw throughput values, we first extract noise tol- 
erant features from the traces and then compare subse- 
quences based on these features. 

A number of feature extraction schemes have been 
proposed for this task in the literature, including the Dis- 
crete Fourier Transform (DFT) and the Discrete Wavelet 
Transform. We use the DFT in our experiments. Each 
point in a throughput sequence was replaced by the first 
f DFT coefficients of window size w centered on that 
point. Thus each reference trace in the database was 
a sequence of non-negative throughput values was re- 
placed by a sequence of f-dimensional Fourier coeffi- 
cients. The low order Fourier coefficients capture the 
dominant low frequency behavior in each window. We 
treat the higher frequency components as noise and ig- 
nore them. The same transformation is applied to the 
query trace. The resulting f-dimensional query trace 
is compared with all subsequences of the same in the 
database. The movie with the closest matching sub- 
sequence is declared a match. Figure 4 illustrates the 
database construction and matching process. 
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Figure 4: Database construction and query matching. The raw throughput traces corresponding to a movie are aligned 
and averaged to produce a single composite trace. A windowed Fourier transform is performed on the composite trace 
and the first f = 2 coefficients are kept. A database of movie signatures is constructed in this manner. A query trace 
is transformed similarly into a signature, and the minimum sliding window distance between the movie signatures and 
the query signature is calculated. The movie with the minimum distance is declared a match. 


We note that exhaustive matching of all subsequences 
would not be computationally feasible in a production 
environment with thousands of references traces. Meth- 
ods based on approximate nearest neighbor searching 
can be used to substantially accelerate the matching pro- 
cess without a significant loss in accuracy [13]. 


Experiments. The above algorithm has two parame- 
ters. The size w of the sliding window used to extract 
the features and the number of Fourier features f, ex- 
tracted from each window. Both affect the recognition 
performance of the algorithm. Small values of w and 
f result in high noise sensitivity, and large values result 
in over-smoothing of the data. The other factor that af- 
fects recognition performance is the length / of the query 
trace. To choose a good parameter setting, we studied 
the behavior of the algorithm described above for vary- 
ing values of w = [100, 300, 600], f = [1, 2, 4]. For each 
setting of the parameters, a random query trace of length 
1 = 6000 was extracted from one of the raw throughput 
traces and compared using the matching algorithm de- 
scribed above. This procedure was repeated 100 times 
for every parameter setting. The highest accuracy was 
obtained for w = 100 and f = 2, ora sliding window of 
10 seconds with two Fourier coefficients per window. 


We now fix w = 100 and f = 2 parameters, vary 
i = [6000, 12000, 18000, 24000] (10, 20, 30, and 40 
minutes), and estimate the prediction accuracy of the 


eavesdropping algorithm. This is done by choosing one 
throughput trace at a time, constructing the reference 
trace database using the rest of the throughput traces 
and then counting how many times random subsequences 
from the chosen trace result in an incorrect prediction. 
The average number of incorrect matches over all traces 
is the leave one out error [18]. In our experiment, 50 ran- 
dom subsequences were chosen from each trace. Some- 
times a good shortlist of possible matches is also useful, 
where the list can be further trimmed with side informa- 
tion, for example, the cable schedule for the area. To ac- 
count for this possibility, not only do we count the num- 
ber of times we get the best match right, we also count 
for varying values of k = 1,...,5, when the algorithm 
correctly ranks the movie amongst the top k matches. 

Table 2 reports the overall accuracy (1-error) of the 
algorithm, where the accuracy (true positive rate) was 
computed over all 26 movies. (We define the true pos- 
itive rate of a movie M as the rate at which a random 
query trace for movie JM is correctly identified as movie 
M; we define the false positive rate of a movie M/ as the 
rate at which a random query trace for a movie M’ 4 M 
is incorrectly identified as movie /.) 

For 10- and 40-minute queries, the overall accuracy 
rates are respectively 62% and 77%. Table 3 and Fig- 
ures 5 and 6 show that the accuracy rate for individual 
movies can be significantly higher. From Table 3, 15 
of our 26 movies had > 98% true positive rates for 40- 
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Figure 5: Confusion matrices for: (a) 10 minute probes from both wired and wireless traces; (b) 40 minute probes from 
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both wired and wireless traces; (c) 40 minute probes from wired traces; (d) 40 minute probes from wireless traces. 
The color scale is on the right; black corresponds to 1.0 and white corresponds to 0.0. 
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Table 2: Overall accuracy of the eavesdropping algo- 
rithm. The rows correspond to 10, 20, 30, and 40 minute 
query traces, and the columns report the success with 
which the algorithm correctly placed the movie in the top 
k matches. The bottom row correspond to the probability 
of a match by random chance. 


minute traces with k = 1, and 22 of our 26 movies had 
< 1% false positive rates for our 40-minute traces with 
k=1, 

Figures 5 (a) and (b) show the confusion matrices for 
10 and 40 minute query traces with k = 1. The shade 
of the cell in row 2, column j denotes the rate at which 
the 7-th movie is identified as the 7-th movie; the cells on 
the diagonal correspond to correct identifications. Con- 
trasting Figures 5 (a) and (b) visually show the increase 
in accuracy as the length of the query trace increases. 
Our wireless traces have a higher level of noise as com- 
pared to our wired traces. Figures 5 (c) and (d) there- 
fore show the confusion matrices for when the query 
is restricted to (c) wired and (d) wireless traces. Note 
that a few movies were misidentified as Caddyshack, 
as represented by the vertical band most visible in Fig- 
ure 5 (c); this is likely due to the fact that the bitrate 
for Caddyshack was fairly constant and the misidentified 
movies had significant noise (e.g., the wireless traces for 
Austin Powers | had significant noise, which influenced 
the composite reference trace and therefore the ability of 
the Austin Power query trace to match to the reference 
trace). 
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Table 3: True and false positive rates for 10 and 40 
minute probes of both wired and wireless traces. The 
true positive rate of a movie MV is the rate at which an 
n-minute query of that movie is correctly identified as 
movie MM. The false positive rate of a movie M is the 
rate at which an n-minute query of some other movie 
M' # M is incorrectly identified as M. 





USENIX Association 


16th USENIX Security Symposium 


61 


N<xS<CHHMDVOVOZErFAcC_TOMMIIOD>D 


° 


0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 


Figure 6: Accuracy per movie for 40 minute query 
traces; k = 1 through k = 5. 


2.4 Limitations, Implications, and Chal- 
lenges 


While our experiments were conducted in a laboratory 
setting, they do reflect some possible configurations that 
one might encounter in a future home equipped with 
many wireless multimedia devices. The implications of 
our results are, therefore, that an adversary in close prox- 
imity to a users’ home might be able to infer informa- 
tion about what videos a user is watching. This adver- 
sary might be a nosy neighbor. Or the adversary might 
be someone sitting outside in a van, looking to collect 
forensics evidence about those viewing “illegal” (e.g., 
censored or pirated) content. Moreover, a content pro- 
ducer (such as the creator of a movie) could intention- 
ally construct its movies to have stronger, more distinc- 
tive fingerprints. This situation would seem to violate 
the user’s perception of privacy within their own home, 
especially given the Slingbox Pro’s use of encryption. 
More broadly, our Slingbox results provide further ev- 
idence that encryption alone cannot fully conceal the 
contents of encrypted data. Other results show that one 
can infer the origins of encrypted web traffic or infer 
application protocol behaviors from encrypted data [30, 
45]. Concurrent with this work, Wright et al. show how 
variable bitrate encodings can reveal the language spo- 
ken through an encrypted VoIP connection [46]. Pro- 
tecting against such information leakage vectors for all 
possible applications seems to be a fundamental chal- 
lenge. Indeed, it may be difficult to simultaneously pre- 
serve desirable properties like low-latency and low band- 
width consumption while also allowing for applications 
with bursty or otherwise data-dependent communication 
properties. As a concrete example, while it may be possi- 


ble to significantly raise the bar against information leak- 
age through the Slingbox by having the Slingbox push 
data at a constant rate while a user is watching a movie, 
a passive eavesdropper may still be able to learn when a 
user watches movies, and for how long. The challenge, 
therefore, is to first determine the possible information 
leakage vectors, understand their implications, and de- 
velop technical means for mitigating them. 


3. The Nike+iPod Sport Kit: Devices that 
Reveal Your Presence 


The Nike+iPod Sport Kit foreshadows the types of 
application-specific UbiComp devices that we might 
soon find ourselves wearing as part of our daily routine. 
Indeed, based on publicly available information about 
the intended usage of the Nike+iPod Sport Kit, as well 
as our own personal observations, we expect that many 
Nike+iPod users will always leave their Nike+iPod sen- 
sors turned on and in their shoes. 

We describe here the steps we took to discover the 
Nike+iPod protocol; our goal was to assess whether the 
Nike+iPod Sport Kit provides protection mechanisms 
against an adversary who wishes to track users’ loca- 
tions. Having uncovered no such protection mechanisms, 
we then describe our subsequent steps to gauge how easy 
and cheap it might be for an adversary to implement our 
attacks. Finally we consider fixes to the Nike+iPod pro- 
tocol as well as some broader research challenges that 
our results raise. 


3.1 Nike+iPod Description 


The Nike+iPod Sport Kit allows runners and walkers to 
hear real time workout progress reports on their iPod 
Nanos. A typical user would purchase an iPod Nano, 
a Nike+iPod Sport Kit, and either a pair of Nike+ shoes 
or a special pouch to attach to non-Nike+ shoes. The 
kit consists of a receiver and a sensor. Users place the 
sensor in their left Nike+ shoe and attach the receiver 
to their iPod Nano as shown in Figure 2. The sensor is 
a 3.5cm x 2.5cm x 0.75cm plastic encased device, and 
the receiver is a 2.5cm x 2cm x 0.5cm plastic encased 
device. When a person runs or walks the sensor begins 
to broadcasts sensor data via a radio transmitter whether 
or not an iPod Nano is present. When the person stops 
running or walking for ten seconds, the sensor goes to 
sleep. When the iPod Nano is in workout mode and the 
receiver’s radio receives sensor data from the sensor, the 
receiver will relay (a function of) that data to the iPod 
Nano, which will then give audio feedback (via the iPod 
headphones) to the person about his or her workout. As 
of September 2006, Apple has sold more than 450,000 
of the $29 (USD) Nike+iPod Sport Kits [1]. 
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3.2 Discovering the Nike+iPod Protocol 


Initial Analysis. The first step was to learn how the 
Nike+iPod sensor communicates with the receiver. Ac- 
cording to the Nike+iPod documentation, a sensor and 
receiver need to be linked together before use; this link- 
ing process involves user participation. Once linked, the 
receiver will only report data from that specific sensor, 
eliminating the readings from other nearby sensors. The 
receiver can also remember the last sensor to which it 
was linked so that users do not need to perform the link- 
ing step every time they turn on their iPods. The receiver 
can also later be linked to a different sensor (for a re- 
placement sensor or different user), but under the stan- 
dard user interface the receiver can only be linked to one 
sensor at any given time. 

We observed, however, that a single sensor could be 
linked to two receivers simultaneously, meaning that two 
people could use their iPod Nanos and the standard user 
interface to read the data from a single Nike+iPod sen- 
sor at the same time. Further investigation revealed that 
the sensor was a transmitter only, meaning that it was 
incapable of knowing what iPod or receiver it was as- 
sociated with. This observation provides the underlying 
foundation for our results since it concretely shows that 
a Nike+iPod Sport Kit does not enforce a strong, exclu- 
sive, one-to-one binding between a sensor and a receiver. 
Having made this observation, we then commenced to 
uncover more details about the Nike+iPod protocol. 


The Hardware, Serial Communications, and Unique 
Identifiers. The Nike+iPod Sport Kit receiver commu- 
nicates with the iPod Nano through the standard iPod 
connector. Examining which pins are present on the re- 
ceiver’s connector and comparing those pins with online 
third-party pin documentation [24], we determined that 
communication was most likely being done over a serial 
connection. 

Opening the white plastic case of the receiver reveals 
a component board and the pin connections to the iPod 
connector. There are ten pins in use; three of these pins 
are used in serial communication: ground, iPod transmit, 
and iPod receive. We verified that digital data was being 
sent across this serial connection using an oscilloscope 
and soldered wires connecting them to the serial port of 
our computer. With the receiver connected to the iPod 
we turned on the iPod and observed data sent in both 
directions over the serial connection. 

As noted above, before the receiver can be used with a 
new sensor, the sensor must be linked with the receiver. 
This is initiated by the user through menus in the iPod 
interface. The user is asked to walk around so that the 
sensor can be detected by the receiver. When the link 
process is started, the iPod sends some data to the re- 
ceiver. Then, the receiver begins sending data back to 


the iPod until the new sensor is discovered and linked by 
the receiver. Finally, the iPod sends some more data back 
to the receiver. 


After collecting and comparing several traces of the 
link process with several different sensors we noticed 
that linking seemed to complete when the third occur- 
rence of a certain packet came from the receiver. These 
packets’ payload started with the same four bytes; how- 
ever, the next four bytes were different depending on 
which sensor we used. In all our experiments these four 
bytes appear to be consistent and unique for a single sen- 
sor, and therefore we refer to these four bytes as the sen- 
sor’s unique identifier or UID. As further corroboration 
for the uniqueness of these UIDs, we find that we can 
use the iPod Nano as an oracle for translating between 
the UIDs and the Nike+iPod sensor’s serial number as it 
appears on the back of the sensor; we omit details but 
instead refer the reader to Figure 7 for a sketch of how 
one might use an iPod Nano as a UID to serial number 
oracle. As suggested above, the Nike+iPod Sport Kit ap- 
pears to use these UIDs for addressing purposes — after 
linking, a receiver will only report packets containing the 
specified UID. 


Automatically Discovering UIDs. Our next step was 
to use the Nike+iPod receiver to listen for sensor UIDs 
in an automated fashion without the iPod Nano. To do 
this we modified an iPod female connector by soldering 
wires from the serial pins on the iPod connector to our 
adapter, adjusted the voltage accordingly, and attached 
3.3V power to the power pin. We then plugged an un- 
modified Nike+iPod receiver into our female connector 
and replayed the data that we saw coming from the iPod 
when the iPod is turned on and then when the iPod en- 
ters link mode. This process caused the receiver to start 
sending packets over the serial connection to our com- 
puter with the identifiers of the broadcasting sensors in 
range. However, because our computer never responds 
to the receiver’s packets, the link process never ends and 
the receiver continues to send to our computer the iden- 
tifiers of transmitting sensors until power is removed. 


Implications. Our observations here immediately imply 
that the Nike+iPod Sport Kit may leak private informa- 
tion about a user’s location. Namely, as is well known 
in the context of other devices (like RFIDs and discover- 
able bluetooth devices [26, 27, 44]), if a wireless devices 
broadcasts a persistent globally unique identifier, an at- 
tacker with multiple wireless sniffers can correlate the 
location of that device (and by inference the user) across 
different physical spaces and over time. 
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Figure 7: The figure on the left shows our approach for passively monitoring the serial communications between an 
iPod and the Nike+iPod receiver; the communications between the iPod and the receiver are over a physical, serial 
connection, and the communication from the sensor to the receiver is via a radio. The figure in the middle shows our 
approach for directly controlling a Nike+iPod receiver from a computer; the communication from the computer to the 
Nike+iPod receiver is over a physical serial connection. The figure on the right shows our approach for translating 


between a sensor’s UID and the sensor’s serial number. 


3.3. Measurements 


To understand the implications of our observations in 
Section 3.2, we must understand the following proper- 
ties of a Nike+iPod sensor: when it transmits; how often 
it transmits; the range at which the receiver hears the sen- 
sor’s UID; and the collision behavior of multiple sensors. 
We have already partially addressed some of these prop- 
erties, but elaborate on our observations here. 


When the sensor is still, it is “sleeping” to save bat- 
tery. When one begins to walk or run with the sensor in 
their shoe, the sensor begins transmitting. It is also pos- 
sible to wake up the sensor without putting it in a shoe. 
For example, shaking the sensor while still in the sealed 
package from the store will cause it to transmit its UID. 
Sensors can also be awakened by tapping them against a 
hard surface or shaking them sharply. Similarly, if a sen- 
sor is in the pocket of one’s pants, backpack, or purse, it 
will occasionally wake up and start transmitting. Once 
walking, running, or shaking ceases, the sensor goes to 
sleep after approximately ten seconds. 


While the sensor is awake and nearby we observed 
that it transmits one packet every second (containing the 
UID). When the sensor is more distant or around a cor- 
ner the receiver heard packets intermittently, but still on 
second intervals. When multiple sensors are awake near 
one another some packets get corrupted (their checksums 
do not match). As the number of awake sensors increase 
so does the number of corrupt packets. However, our 
tests with seven sensors indicated the receiver still hears 


every sensor UID at least once in a ten second window. 
During our experiments with the Nike+iPod sensors we 
observed approximately a 10 meter range indoors and a 
10-20 meter range outdoors. Sensors are also detectable 
while moving quickly. Running by a receiver at approx- 
imately 10 MPH, the sensor is reliably received. Driv- 
ing by someone walking with a sensor in their shoe, the 
sensor can be reliably detected at 30 MPH. We have not 
tested faster speeds. 


3.4 Instrumenting Attacks 


Section 3.2 shows that it is possible for an adversary 
to extract a Nike+iPod sensor’s UIDs from sniffed ra- 
dio transmissions, and Section 3.3 qualifies the circum- 
stances under which the receiver might be able to sniff 
those transmissions. These results already enable us to 
conclude that, despite broad awareness about the tracka- 
bility concerns with unique identifiers in other technolo- 
gies (e..g, RFIDs, discoverable bluetooth), new commer- 
cial products are still entering the market without any 
strong protection mechanisms for ensuring users’ loca- 
tion privacy. 

We now seek to explore just how easy — in terms of 
cost and technical sophistication — it might be for an 
adversary to exploit the Nike+iPod Sport Kit’s lack of 
location privacy protection and, at the same time, to ex- 
plore the types of applications that an adversary might 
build. For example, one application that we built is a 
GoogleMaps-based system that pools data from multiple 
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Nike+iPod sniffers and displays the resulting tracking in- 
formation on a map in real-time. When assessing the 
ease with which an attacker might be able to implement 
a Nike+iPod-based surveillance system, it is worth not- 
ing that the attacker may not need to write source code 
him or herself, but may instead download the necessary 
software from somewhere on the Internet. We built the 
following components and systems: 


e Receiver to USB Adaptor. We created a compact 
USB receiver module for connecting the Nike+iPod 
receiver to a computer via USB. Our module does not 
require any modification to the Nike+iPod receiver; 
see Figure 3b, and consists of a female iPod connec- 
tor [23] and a serial-to-USB board utilizing the FTDI 
FT2232C chipset [14]. We connected the serial pins 
and power pins of the iPod connector to the appro- 
priate pins of the FT2232C board. When this module 
is connected to a computer, the receiver is then pow- 
ered and a USB serial port is made available for our 
software to communicate with the receiver. With the 
receiver attached, this package is approximately 3cm 
x 3cm x 2cm. 

We also created a windows serial communications 
tool for interfacing with the Nike+iPod Receiver using 
our adapter. Our tool can detect the UIDs of nearby 
Nike+iPod sensors and transmit those UID readings, 
a timestamp, and latitude and longitude information 
to a back-end SQL server for post-processing; the lat- 
itude and longitude are currently set manually. Op- 
tionally, when a sensor is detected, this application can 
take photographs with a USB camera and upload those 
photographs to the SQL server along with the UID in- 
formation. This application can also SMS or email 
sensor information to pre-specified phone numbers or 
email addresses. 


e Gumstixs. We also implemented a cheap Nike+iPod 
surveillance device using the Linux-based gumstix 
computers. This module consists of an unmodi- 
fied $29 Nike+iPod receiver, a $109 gumstix connex 
200xm motherboard, a $79 wifistix, a $27.50 gum- 
stix breakout board, and a $2.95 female iPod connec- 
tor. The Nike+iPod receiver is connected directly to 
the gumstix’s serial port, thereby eliminating the need 
for our serial-to-USB adaptor. The assembled pack- 
age is 8cm x 2.lcm x 1.3cm and weighs 1.1 ounces; 
see Figure 3a. 

Our gumstix-based module runs a 280 line C 
program that communicates with the Nike+iPod re- 
ceiver over a serial port and that uses the wifistix 
802.11 wireless module to wirelessly transmit real- 
time surveillance data to a centralized back-end 
server. The real-time reporting capability allows the 
gumstix module to be part of a larger real-time surveil- 


lance system. If an adversary does not need this real- 
time capability, then the adversary can reduce the cost 
of this module by omitting the wifistix. 


e A Distributed Surveillance System. To illustrate the 
power of aggregating sensor information from mul- 
tiple physical locations, we created a GoogleMaps- 
based web application. Our web application uses and 
displays the sensor event data uploaded to a central 
SQL server from multiple data sources. The data 
sources may be our serial communication tool or our 
gumstix application. 

In real-time mode, sensors’ UIDs are overlayed on 
a GoogleMaps map at the location the sensor is seen. 
When the sensor is no longer present at that location, 
the UID disappears. Optionally, digital pictures taken 
by a laptop when the sensor is first seen can be over- 
layed instead of the UID. In history mode, the web 
application allows the user to select a timespan and 
show all sensors recorded in that timespan. For exam- 
ple, one could select the timespan between noon and 
6pm on a given day; all sensors seen that afternoon 
will be overlayed on the map at the appropriate loca- 
tion. 

This application would allow many individuals to 
track people of interest. An attacker might also use 
this tool to establish patterns of presence. If many at- 
tackers with receivers cooperated, this software and 
website would allow the tracking and correlation 
of many people with Nike+iPod sensors. Among 
the related research, demonstration, and commercial 
bluetooth- and 802.11 wireless-based tracking sys- 
tems (e.g., [6, 8, 10, 17, 31, 37, 39]), we are unaware 
of any other location-based surveillance system that 
goes as far as plotting subjects’ locations on a map in 
real-time. 


We also developed two other surveillance devices — one 
which uses a third-generation iPod and iPod Linux to de- 
tect nearby Nike+iPod sensors, and the other of which 
uses a second-generation Intel Mote (iMote2) to detect 
nearby Nike+iPod sensors and beams the recorded infor- 
mation to a paired Microsoft SPOT watch via bluetooth. 
For brevity, and since the above applications provide a 
survey of the applications that we developed, we omit 
discussion of our iPod Linux- and iMote2-based appli- 
cations here. 


3.5 Privacy-Preserving Alternatives 


Our results show that, despite public awareness of the 
importance of location privacy and untrackability, major 
new products are still being introduced without strong 
privacy guards. We consider this situation unfortunate 
since in many cases it is technically possible to signifi- 
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cantly improve consumer privacy. 


Exploiting (Largely) Static Associations. Consider the 
typical usage scenario for the Nike+iPod Sport Kit. In 
the common case, we expect that once a user purchases 
a Nike+iPod Sport Kit, he or she will rarely use the sen- 
sor from that kit with the receiver from a different kit. 
This means that the sensor and the receiver could have 
been pre-programmed at the factory with a shared secret 
cryptographic key. By having the sensor encrypt each 
broadcast message with this shared key, the Nike+iPod 
designers could have addressed most of our privacy con- 
cerns about the Nike+iPod application protocol; there 
may still be information leakage through the underlying 
radio hardware, which would have to be dealt with sep- 
arately. If the manufacturer decides a sensor from one 
kit should be used with the receiver from a separate kit, 
then several options still remain. For example, under the 
assumption that one will only rarely want to use a sensor 
from one kit with a receiver from another, the crypto- 
graphic key could be written on the backs of the sensors, 
and a user could manually enter that key into their iPods 
or computers before using that new sensor. Alternately, 
the sensor could have a special button on it that, when 
pressed, causes the sensor to actually broadcasts a cryp- 
tographic key for some short duration of time. 


Un-Sniffable Unique Identifiers. Assume now that 
both the sensor and the receiver in a Nike+iPod Sport Kit 
are preprogrammed with the same shared 128-bit cryp- 
tographic key kK. One design approach would be for 
the sensor to pre-generate a new pseudorandom 128-bit 
value X during the one-second idle time between broad- 
casts. Although the sensor could generate X using phys- 
ical processes, we suggest generating X by using AES 
in CTR mode with a second, non-shared 128-bit AES 
key K’. Also during this one-second idle time between 
broadcast, the sensor could pre-generate a keystream 
S using AES in CTR mode, this time with the initial 
counter X and the shared key K. Finally, when the 
sensor wishes to send a message MM to the correspond- 
ing receiver, the sensor would actually send the pair 
(X,M @ S), where “&” denotes the exclusive-or oper- 
ation. Upon receiving a message (X,Y), the receiver 
would re-generate S from X and the shared key Kk, re- 
cover M as Y @ S, and then accept VM as coming from 
the paired sensor if M/ contains the desired UID. This 
construction shares commonality with the randomized 
hash lock protocol for anonymous authorization [42] in 
which an RFID tag reader must try all tag keys in order 
to determine the identity of an RFID tag; in our case a 
receiver must attempt to decrypt all received messages, 
even when the messages are intended for other receivers. 
While it is rather straightforward to argue that this con- 
struction provides privacy at the application level against 


passive adversaries (by leveraging Bellare et al.’s [4] 
provable security results for CTR mode encryption), we 
do acknowledge that this construction may not fully pro- 
vide all desired target security properties against active 
adversaries. Furthermore, we acknowledge that there are 
ways of optimizing the approach outlined above, and that 
the above approach may affect the battery life, manufac- 
turing costs, and usability of the Nike+iPod Sport Kit. 


Use an On-Off Switch. One natural question to ask is 
whether a sufficient privacy-protection mechanism might 
simply be to place on-off switches directly on all mobile 
personal devices, like the Nike+iPod Sport Kit sensors. 
Unfortunately, this approach by itself will not protect 
consumers’ privacy while the devices are in operation. 
Additionally, we believe that it is unrealistic to assume 
that most users will actually turn their devices off when 
not in use, especially as the number of such personal de- 
vices increases over time. 


3.6 Challenges 


While the above discussion clearly shows that it is pos- 
sible to significantly improve upon the privacy proper- 
ties of the current Nike+iPod Sport Kits, from a broader 
perspective the solutions advocated above are somewhat 
unsatisfying. For example, how does one generalize the 
above recommendations (or derive new recommenda- 
tions) for wireless devices that do not have largely static 
pairings, such as commercial 802.11 wireless hot spots 
or the dynamic peer-to-peer pairings of the Zune, where 
one may wish to allow for ad hoc network formations but 
still restrict access to only authorized devices? And how 
does one reduce the extra costs (e.g., battery lifetime, 
packet size, the need to decrypt packets intended to other 
parties), to environments that cannot afford the extra re- 
source requirements? If we wish to provide a strong level 
of location privacy for future UbiComp devices, we need 
to develop mechanisms for handling such broad classes 
of situations. 


The challenge, therefore, is to provide anonymous 
communications for wireless devices in more diverse and 
potentially ad hoc environments. This challenge is not 
unique to us — indeed, others have also considered this 
problem in other restricted contexts [16, 21, 33, 42, 28, 
33, 44] — but bears repeating given the potential com- 
plexities; e.g., while we have focused this discussion on 
unique identifiers, which by themselves are not trivial to 
address, application characteristics and other side chan- 
nel information, which can survive encryption [30, 45], 
might facilitate the tracking and identification of individ- 
uals. 





66 


16th USENIX Security Symposium 


USENIX Association 


4 Zunes: Challenges with Managing Ad 
Hoc Mobile Social Interactions 


The Microsoft Zune portable media player is one of the 
first portable media devices to include wireless capabil- 
ity for the purpose of sharing media. Zune owners can 
enter a coffee shop, turn on their Zune, and discover 
nearby Zunes. Once a nearby Zune is discovered, users 
can send music or photos to the nearby Zune. Discov- 
ery and sharing are meant to facilitate social interaction; 
hence the Zune slogan: “Welcome to the Social.” Like 
the Nike+iPod Sport Kit and SlingBox, the Zune repre- 
sents a gadget pioneering a new application space and 
represents a central example of our third class of Ubi- 
Comp devices geared toward catalyzing new social in- 
teractions. However, we demonstrate that there are chal- 
lenges with protecting users’ privacy and safety while 
simultaneously providing ad hoc communications with 
strangers. 


4.1 Zune Description 


We focus this description on how the Zune media player 
allows users to control their social interactions. Consider 
a scenario consisting of two users, Alice and Bob, and as- 
sume that Alice and Bob respectively name their Zunes 
AliceZune and BobZune; Alice and Bob choose these 
names when they configure their Zune. If Bob wishes to 
utilize the Zune social system, to see who’s around, he 
would first use the Zune interface navigate to the “com- 
munity — nearby devices” menu. He will then see the 
names of all discoverable nearby Zunes and, depending 
on the options chosen by the owners of the other Zunes, 
the names of the songs that his neighbors are listening 
to or their state (online/busy). If Bob wishes to share a 
song or picture with his neighbors, he must first select the 
song or picture and then select the “send” option. The 
Zune will then show Bob the names of nearby Zunes, 
and Bob can then send the song or picture to a neighbor 
of his choosing, in this case AliceZune. The interface 
on Alice’s Zune asks whether Alice wishes to accept a 
song from BobZune; no additional information about the 
song or picture is included in the prompt. Alice has two 
choices: to accept the content or to not accept the con- 
tent. If Alice accepts the song and later decides that she 
would like to prevent Bob from ever sending her a song 
in the future, she can navigate to her Zune’s “community 
— nearby devices” menu, select BobZune, and then select 
the “block” option. 


4.2 Circumventing the Zune _ Blocking 
Mechanism 


Microsoft appears to envision a world where Zune own- 
ers wish to receive interesting content from people they 
have never met before. Of course, these users also wish 
to avoid being bothered by people or companies that 
send inappropriate or annoying content, hence the Zune’s 
blocking feature. Such a situation is not purely hypothet- 
ical; indeed, there has recently been media reports about 
advertisers beaming unsolicited content to users with dis- 
coverable Bluetooth devices [7]. 

Unfortunately, we find that a malicious adversary 
could circumvent the Zune blocking feature, and we 
have verified this in practice. The critical issue revolves 
around how blocking is actually implemented on the 
Zunes. When Bob sends a song or image to Alice, Al- 
ice is only given the option of accepting or denying the 
song or image; she is not given the option of blocking 
the sender. Then, after playing the song or viewing the 
image, if Alice wishes to block Bob’s Zune in the future, 
she must navigate to the “community — nearby devices” 
menu and actively choose to block BobZune. 

The crux of the problem is that Alice will not be able 
to block Bob’s Zune if BobZune is no longer nearby or 
discoverable. 


Disappearing attack Zune. A simple method to circum- 
vent the Zune block feature is, after beaming an inap- 
propriate image, to turn the wireless on the originating 
Zune off. Since Alice may remember the name of Bob’s 
Zune, and thereby simply deny messages from BobZune 
in the future, Bob can change the name of his Zune 
before trying to beam Alice additional content. Also, 
before beaming Alice the inappropriate content in the 
first place, Bob could scan his nearby community, find 
a nearby Zune named CharlieZune, and then name his 
Zune CharlieZune. If Bob sends inappropriate content to 
Alice and then turns off his wireless, he might trick Alice 
into blocking the real CharlieZune. 


Fake MAC addresses. Upon further investigation, we 
find that the Zune neighbor discovery process and block- 
ing mechanism is based on 802.11 probe-responses and 
MAC addresses. Bob could therefore use a Linux laptop 
to fool Alice into thinking that she has blocked BobZune 
when in fact she has not; unlike the observation in the 
previous paragraph, our attack here works even when 
there are no other nearby Zunes. 

Building on the scenario above, where Bob sends in- 
appropriate content to Alice, disables his Zune’s wire- 
less, and changes his Zune’s name. Suppose Alice does 
not like the content she received from Bob and navigates 
to the nearby list on her Zune. Bob can use his laptop 
to send out Zune 802.11 probe-responses with the same 
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name that his Zune was using but with a different MAC 
address. Alice will then see the previous name of Bob’s 
Zune in her nearby list and select the block command. 
It will now appear to Alice that she has blocked Bob’s 
Zune. Conversely, what has actually occurred is Alice 
has blocked a different MAC address. The next time Bob 
enables his Zune’s wireless and attempts to send inappro- 
priate content to Alice, it will appear to Alice that Bob 
is sending content from a third BobZune that Alice has 
never seen before. We have implemented a C application 
for Linux that uses the MadWiFi drivers and an Atheros 
Chipset-based wireless card to listen to 802.11 probe- 
requests from Zunes and send a Zune probe-response 
with whatever name and MAC address the user desires. 


Post-blocking privacy. Lastly, even when the blocking 
mechanism is used successfully, it only stops Alice from 
receiving new content pushes from BobZune; the mali- 
cious user, Bob, can still detect Alice’s presence unless 
she turns off her Zune’s wireless capability all together; 
this has the negative side effect of preventing Alice from 
sharing any media at all if she doesn’t want to be de- 
tectable by Bob. 


4.3 Improving User Control 


Perhaps the most natural method for protecting against 
such unsolicited content is to adopt what is now common 
practice in other social applications, such as instant mes- 
saging: create a “buddy list” and only accept connections 
from known buddies. One might populate the buddy list 
using some interactions that require two Zunes to be in 
close proximity [2]. Such a buddy list is, however, in di- 
rect conflict with the Zune’s intended goal of initiating 
ad hoc interactions with total strangers. 

Therefore, the goal is to improve the resistance of 
the Zune blocking mechanisms to attacks like those we 
present above. One simple solution to Bob’s blocking 
circumvention is to record which Zune sent the specific 
media and allow the user to block the sender of media 
even if they are not currently nearby and active. We note, 
however, that there are some subtleties that one must 
consider. For example, since the Zune blocking mech- 
anism described above seems to be based on the Zune’s 
MAC address (recall that our C program in Section 4.2 
created 802.11 probe-responses with forged MAC ad- 
dresses to trick the Zune blocking feature) Bob might 
still be able to circumvent this improved blocking mech- 
anism by mounting a MAC-rewriting man-in-the-middle 
attack between his Zune and Alice’s. Since the Zune’s 
communicate using encryption, MAC rewriting of this 
form will not, however, be successful if the Zunes’ MAC 
addresses are used as input to the encryption key deriva- 
tion process. We have currently not successfully deter- 
mined whether or not this is actually the case, but argue 


below that the use of MAC addresses for this purpose is 
fundamentally problematic if one also wishes to protect 
information about a user’s presence to outsiders (recall 
Section 3). 


4.4 Challenges 


While there has been significant research on providing 
control over private information in social networks in 
ubiquitous social applications, much of the work fo- 
cuses on situations with hierarchical or other complex 
relationships, such as boss/spouse/friend or buddies/non- 
buddies [22]. While there is still much work to be done in 
this space, the Zunes suggest another scenario in which a 
key target application is to share content with strangers. 
Blocking individuals in such a scenario can be very chal- 
lenging when users have complete control over the infor- 
mation that their devices present to others. 

When all the devices are homogeneous and incorpo- 
rate a secure hardware module, one possibility is to let 
that secure hardware control what information is shared 
with the user and other devices, and to ensure that some 
information (such as a unique identifier) is not mutable 
by the user. The secure hardware might then use this non- 
mutable information to control blocking. Coupled with 
the discussion in Section 3, one must ensure that these 
unique identifiers do not reveal private information about 
a user’s presence. For example, this unique identifier 
should not be an 802.11 MAC address, which the Zunes 
currently appear to use for blocking purposes. While 
there might be approaches for addressing this problem in 
the case of homogeneous devices with secure hardware 
from the same manufacturer (e.g., restricted behavior on 
the secure hardware and symmetric key agreement using 
the exchange of anonymous public keys [3] signed using 
a group signature scheme [9]), solving this problem in 
the case of a heterogeneous environment appears to be a 
challenge. 


5 Conclusions 


We technically explore privacy and security properties 
of several commercial UbiComp products. We find that 
despite research and public awareness, these products do 
not provide strong levels of privacy protection and do not 
put the user in control of their private information. 

Our analysis of the encrypted SlingBox stream sug- 
gests that transmission characteristics from variable data 
rate encoding can cause information leakage even when 
such a stream is encrypted. This puts users privacy at 
risk because one might assume encryption is enough to 
thwart an eavesdropper from learning what media one is 
watching. Our first attempt at recognizing movies via 
their variable throughput in a 26 movie database yielded 
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an overall accuracy of approximately 62% for the best 
match and 73% for ranking in the top 5 matches when 
a 10 minute query trace was used, and 77% and 89% re- 
spectively when a 40 minute query trace was used; which 
compared with the 4% and 21% respectively that one 
can expect with random guessing, shows much informa- 
tion leakage. For certain movies our accuracy rates are 
significantly higher; for example, for 15 out of our 26 
movies, a 40-minute query trace will match with the cor- 
rect movie over 98% of the time. When a variable data 
rate encoding is used, a content provider could poten- 
tially increase this accuracy by using a throughput-based 
watermarking scheme. 

Persistent identifiers in the Nike+iPod Sport Kit and 
Zune potentially reveal presence and, in the Nike+iPod 
case, we demonstrate how a tracking system can be built 
using the Nike+iPod Sport Kit sensors and receivers. We 
argue that these persistent identifiers should not be used 
in future devices and should instead be replaced with 
other privacy preserving mechanisms. 

Finally, our evaluation of the Zune blocking scheme 
shows that an interface design choice coupled with a 
technology choice can take control away from the con- 
sumer and put it in the hands of malicious users. To- 
gether, the results from this paper demonstrate with new 
classes of devices come new privacy and security chal- 
lenges; privacy must be designed in at all levels of the 
protocol stack. 
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Abstract 


Newly published data, when combined with existing 
public knowledge, allows for complex and sometimes 
unintended inferences. We propose semi-automated 
tools for detecting these inferences prior to releasing 
data. Our tools give data owners a fuller understanding 
of the implications of releasing data and help them ad- 
just the amount of data they release to avoid unwanted 
inferences. 

Our tools first extract salient keywords from the pri- 
vate data intended for release. Then, they issue search 
queries for documents that match subsets of these key- 
words, within a reference corpus (such as the public 
Web) that encapsulates as much of relevant public knowl- 
edge as possible. Finally, our tools parse the documents 
returned by the search queries for keywords not present 
in the original private data. These additional keywords 
allow us to automatically estimate the likelihood of cer- 
tain inferences. Potentially dangerous inferences are 
flagged for manual review. 

We call this new technology Web-based inference 
control. The paper reports on two experiments which 
demonstrate early successes of this technology. The first 
experiment shows the use of our tools to automatically 
estimate the risk that an anonymous document allows 
for re-identification of its author. The second experiment 
shows the use of our tools to detect the risk that a doc- 
ument is linked to a sensitive topic. These experiments, 
while simple, capture the full complexity of inference de- 
tection and illustrate the power of our approach. 


1 Introduction 


Information has never been easier to find. Search en- 
gines allow easy access to the vast amounts of infor- 
mation available on the Web. Online data repositories, 
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newspapers, public records, personal webpages, blogs, 
etc., make it easy and convenient to look up facts, keep 
up with events and catch up with people. 

On the flip side, information has never been harder to 
hide. With the help of a search engine or web informa- 
tion integration tool [45], one can easily infer facts, re- 
construct events and piece together identities from frag- 
ments of information collected from disparate sources. 
Protecting information requires hiding not only the in- 
formation itself, but also the myriad of clues that might 
indirectly lead to it. Doing so is notoriously difficult, as 
seemingly innocuous information may give away one’s 
secret. 

To illustrate the problem, consider a redacted biogra- 
phy [8] (shown in the left-hand side of figure 6) that was 
released by the FBI. Prior to publication, the biography 
was redacted to protect the identity of the person whom 
it describes. All directly identifying information, such as 
first and last names, was expunged from the biography. 
The redacted biography contains only keywords that ap- 
ply to many individuals, such as “half-brother”, “Saudi’’, 
“magnate” and “Yemen”. None of these keywords is par- 
ticularly identifying on its own, but in aggregate they al- 
low for near-certain identification of Osama Bin Laden. 
Indeed, a Google search for the query “Saudi magnate 
half-brother” returns in the top 10 results, pages that are 
all related to the Bin Laden family. This inference, as 
well as potentially many others, should be anticipated 
and countered in a thorough redaction process. 

The need to protect secret information from unwanted 
inferences extends far beyond the FBI. In addition to in- 
telligence agencies and the military, numerous govern- 
ment agencies, businesses and individuals face the prob- 
lem of insulating their secrets from the information they 
disclose publicly. In the litigation industry for example, 
information protected by client-attorney privilege must 
be redacted from documents prior to disclosure. In the 
healthcare industry, it is common practice and mandated 
by some US state laws, to redact sensitive information 
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(such as HIV status, drug or alcohol abuse and mental 
health conditions) from medical records prior to releas- 
ing them. Among individuals, anonymous bloggers are 
a good example of people who seek to ensure that their 
posts do not disclose their secret (their identity). This 
is made challenging by the fact that in some cases very 
little personal information may suffice to infer the blog- 
ger’s identity. For example, if the second author of this 
paper were to reveal his first name (Philippe) and men- 
tion the first name of his wife (Sanae), then his last name 
(or at least, a strong candidate for his last name) can be 
inferred from the first hit returned by the Google query, 
“Philippe Sanae wedding”. 


In all these instances, the problem is not access con- 
trol, but inference control. Assuming the existence of 
mechanisms to control access to a subset of informa- 
tion, the problem is to determine what information can 
be released publicly without compromising certain se- 
crets, and what subset of the information cannot be re- 
leased. What makes this problem difficult is the quantity 
and complexity of inferences that arise when published 
data is combined with, and interpreted against, the back- 
drop of public knowledge and outside data. 


This paper breaks new ground in considering the prob- 
lem of inference detection not in a restricted setting (such 
as, e.g., database tables), but in all its generality. We 
propose the first all-purpose approach to detecting un- 
wanted inferences. Our approach is based on the ob- 
servation that the combination of search engines and the 
Web, which is so well suited to detect inferences, works 
equally well defensively as offensively. The Web is an 
excellent proxy for public knowledge, since it encapsu- 
lates a large fraction of that knowledge (though certainly 
not all). Furthermore, the dynamic nature of the Web 
reflects the dynamic nature of human knowledge and 
means that the inferences detected today may be different 
from those drawn yesterday. The likelihood of certain in- 
ferences can thus be estimated automatically, at any point 
in time, by issuing search queries to the Web. Returning 
to the example of the biography redacted by the FBI, a 
simple search query could have flagged the risk of re- 
identification coming from the keywords “Saudi”, “mag- 
nate” and “half-brother”. 

The Web is an ideal resource for identifying infer- 
ences because keyword search allows for efficient de- 
tection of the information that is associated with an in- 
dividual. Such associations can be just as important in 
identifying someone as their personal attributes. As an 
example, consider the fact that the top 2 hits returned by 
the Google query, “pop singer vogueing”! have nothing 
to do with the singer Madonna, whereas the top 3 hits re- 
turned by the Google query, “gay pop singer vogueing”” 
all pertain to Madonna. The attribute “gay” helps to fo- 
cus the results not because it is an attribute of Madonna 


(at least not as it is used in the top 3 hits) but rather it 
is an attribute associated with a large subset of her fan- 
base. Similarly, the entire first page of hits returned by 
the query “naltrexone acamprosate”’ all pertain to alco- 
holism, not because they are alcoholism symptoms or in 
some other way part of the definition of alcoholism, but 
rather they are associated with alcoholism because they 
are drugs commonly used in its treatment. 

We propose generic tools for detecting unwanted in- 
ferences automatically using the Web. These tools first 
extract salient keywords from the private data intended 
for release. Then, they issue search queries for docu- 
ments that match subsets of these keywords, within a 
reference corpus (such as the public Web) that encapsu- 
lates as much of relevant public knowledge as possible. 
Finally, our tools parse the documents returned by the 
search queries for keywords not present in the original 
private data. These additional keywords allow us to au- 
tomatically estimate the likelihood of certain inferences. 
Potentially dangerous inferences are flagged for manual 
review. We call this new technology Web-based infer- 
ence control. 

We demonstrate the success of our inference detection 
tools with two experiments. The first experiment shows 
the use of our tools to automatically estimate the risk that 
an anonymous document allows for re-identification of 
its author. The second experiment shows the use of our 
tools to detect the risk that a document is linked to a sen- 
sitive topic. These experiments, while simple, capture 
the full complexity of inference detection and illustrate 
the power of our approach. 


OVERVIEW. We discuss related work in section 2. 
We define our models and tools, as well as our basic 
algorithm for Web-assisted inference detection in sec- 
tion 3. We list a number of potential applications of 
Web-assisted inference control in section 4. Section 5 
describes two experiments that demonstrate the success 
of our inference control tools. Section 6 provides an ex- 
ample using Web-based inference detection to improve 
the redaction process. We conclude in section 7. 


2 Related Work 


Our work can be viewed both as a new technique for in- 
ference detection and as a new way of leveraging Web 
search to understand content. There is substantial exist- 
ing work in both areas, but ours is the first Web-based 
approach to inference detection. We discuss the most 
closely related work in these areas below. 


INFERENCE DETECTION. Most of the previous work on 
inference detection has focused on database content (see, 
for example, [33, 21, 43, 19]). Work in this area takes 
as input the database schema, the data themselves and, 
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sometimes, relations amongst the attributes of the data- 
base that are meant to model the outside knowledge a 
human may wield in order to infer sensitive information. 
To the best of our understanding, no systematic method 
has been demonstrated for integrating this outside know]- 
edge into an inference detection system. Our work seeks 
to remedy this by demonstrating the use of the Web for 
this purpose. When coupled with simple keyword extrac- 
tion, this general technique allows us to detect inference 
in a variety of unstructured documents. 

A particular type of inference allows the identifica- 
tion of an individual. Sweeney looks for such inferences 
using the Web in [35] where inferences are enabled by 
numerical values and other attributes characterizable by 
regular expressions such as SSNs, account numbers and 
addresses. Sweeney does not consider inferences based 
on English language words. We use the indexing power 
of search engines to detect when words, taken together, 
are closely associated with an individual. 

The closely related problem of author identification 
has also been extensively studied by the machine learn- 
ing community (see, for example, [25, 11, 24, 34, 20]). 
The techniques developed generally rely on a training 
corpus of documents and use specific attributes like self- 
citations [20] or writing style [25] to identify authors. 
Our work can be viewed as exploiting a previously un- 
studied method of author identification, using informa- 
tion authors reveal about themselves to identify them. 

Atallah, et al. [2], describe how natural language 
processing can potentially be used to sanitize sensi- 
tive information when the sanitization rules are already 
known. Our work is focused on using the Web to iden- 
tify the sanitization rules. 


WEB-ASSISTED QUERY INTERPRETATION. There is a 
large body of work on using the Web to improve query 
results (see, for example, [16, 32, 10]). One of the funda- 
mental ideas that has come out of this area is to use over- 
lap in query results to establish a connection between dis- 
tinct queries. In contrast, we analyze the content of the 
query results in order to detect connections between the 
query terms and an individual or topic. 


WEB-BASED SOCIAL NETWORK ANALYSIS. Recently, 
the Web has been used to detect social networks (e.g., 
[1, 23]). A key idea in this work is using the Web to look 
for co-occurences of names and using this to infer a link 
in a social network. Our techniques can support this type 
of analysis, when, for example, names in a network when 
entered as a Web query, yield a name that is not already 
in the network. However, our techniques are aimed at 
a broader goal, that is, understanding all inferences that 
can be drawn from a document. 


WEB-ASSISTED CONTENT ANALYSIS AND ANNOTA- 
TION. There is a large body of work on using the Web 


to understand and analyze content. Nakov and Hearst 
[30] have shown the power of using the Web as training 
data for natural language analysis. Web-assistance for 
extracting keywords for the purposes of content indexing 
and annotation is studied in [12, 37, 26]. This work is fo- 
cused on automated, Web-based tools for understanding 
the meaning of the text as written, as opposed to the in- 
ferences that can be drawn based on the text. That said, 
in our work we use very simple content analysis tools, 
and improvements to our approach could involve more 
sophisticated content analysis tools including Web-based 
tools such as those developed in these works. 


WEB-BASED DATA AGGREGATION. Finally, we note 
that the commercial world is beginning to offer Web- 
based data aggregation tools (see, for example [14, 13, 
31]) for the purposes of tracking competitor behavior, 
doing market analysis and intelligence gathering. We are 
not aware of support for pre-production inference control 
in these offerings, as is the focus of this paper. 


3 Model and Generic Algorithm 


Let C denote a private collection of documents that is 
being considered for public release, and let # denote a 
collection of reference documents. For example, the col- 
lection C may consist of the blog entries of a writer, and 
the collection ® may consist of all documents publicly 
available on the Web. 

Let K(C) denote all the knowledge that can be com- 
puted from the private collection C. The set K(C) infor- 
mally represents all the statements and facts that can be 
logically derived from the information contained in the 
collection C. The set A (C) could in theory be computed 
with a complete and sound theorem prover given all the 
axioms in C. In practice, such a computation is impos- 
sible and we will instead rely on approximate represen- 
tations of the set K(C). Similarly let A(R) denote all 
the knowledge that can be computed from the reference 
collection R. 

Informally stated, the problem of inference control 
comes from the fact that the knowledge that can be ex- 
tracted from the union of the private and reference col- 
lections K(C U R) is typically greater than the union 
K(C) U K(R) of what can be extracted separately from 
C and R. The inference control problem is to understand 
and control the difference: 


Diff(C,R) = K(CUR) — (KO) U K(R)). 


Returning to the Osama Bin Laden example discussed 
in the introduction, consider the case where the col- 
lection C consists of the single declassified FBI docu- 
ment [8], and where R consists of all information pub- 
licly available on the Web. Let S denote the statement: 
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“The declassified FBI document is a biography of Osama 
Bin Laden”. Since the identity of the person to whom the 
document pertains has been redacted, it is impossible to 
learn the statement S from C alone, and so S ¢ K(C). 
The statement S is clearly not in A(R) either since it is 
impossible to compute from FR alone a statement about 
a document that is in C but not in R. It follows that S 
does not belong to A(C) U K(R). But, as shown ear- 
lier, the statement S belongs to K(C UR). Indeed, we 
learn from C that the document pertains to an individ- 
ual characterized by the keywords “Saudi”, “magnate”, 
“half-brothers”’, “Yemen”, etc. We learn from 7 that 
these keywords are closely associated with “Osama Bin 
Laden’’. If we combine these two sources of information, 
we learn that the statement S is true with high probabil- 
ity. 

It is critical to understand Diff(C, 7) prior to pub- 
lishing the collection C of private documents, to en- 
sure that the publication of C does not allow for un- 
wanted inferences. The owner of C may choose to with- 
hold from publication parts or all of the documents in 
the collection based on an assessment of the difference 
Diff(C, 7). Sometimes, the set of sensitive knowledge 
* that should not be leaked is explicitly specified. In 
this case, the inference control problem consists more 
precisely of ensuring that the intersection Diff(C, R) M 
K* is empty. 


3.1 Basic Approach 


In this work, we consider the case in which C can be any 
arbitrary collection of documents. In particular, contrary 
to prior work on inference control in databases, we do 
not restrict ourselves to private documents formatted ac- 
cording to a well-defined structure. We assume that the 
collection ® of public documents consists of all publicly 
available documents, and that the public Web serves as a 
good proxy for this collection. Our generic approach to 
inference detection is based on the following two steps: 


1. UNDERSTANDING THE CONTENT OF THE DOCU- 
MENTS IN THE PRIVATE COLLECTION C. We employ 
automated content analysis in order to efficiently extract 
keywords that capture the content of the document in the 
collection C. A wide array of NLP tools are possible for 
this process, ranging from simple text extraction to deep 
linguistic analysis. For the proof-of-concept demonstra- 
tions described in section 5, we employ keyword selec- 
tion via a “term frequency - inverse document frequency” 
(TE.IDF) calculation, but we note that a deeper linguistic 
analysis may produce better results. 


2. EFFICIENTLY DETERMINING THE INFERENCES 
THAT CAN BE DRAWN FROM THE COMBINATION OF 
C AND R. We issue search queries for documents that 


match subsets of the keywords extracted in step 1, within 
a reference corpus (such as the public Web) that encap- 
sulates as much of relevant public knowledge as possi- 
ble. Our tools then parse the documents returned by the 
search queries for keywords not present in the original 
private data. These additional keywords allow us to au- 
tomatically estimate the likelihood of certain inferences. 
Potentially dangerous inferences are flagged for manual 
review. 


3.2 Inference Detection Algorithm 


In this section, we give a generic description of our infer- 
ence detection algorithm. This description emphasizes 
conceptual understanding. Specific instantiations of the 
inference detection algorithms, tailored to two particular 
applications, are given in section 5. These instantiations 
do not realize the full complexity of this general algo- 
rithm partly for efficiency reasons and partly because of 
the attributes of the application. We start with a descrip- 
tion of the inputs, outputs and parameters of our generic 
algorithm. 


INPUTS: A private collection of documents C = 
{C,,...,Cy}, a collection of reference documents R 
and a list of sensitive keywords K™* that represent sen- 
sitive knowledge. 


OUTPUT: A list £ of inferences that can be drawn from 
the union of C and ?. Each inference is of the form: 


(Wi,...,W,) > KG, 

where W,,..., W; are keywords extracted from docu- 
ments in C, and kj C K™* is a subset of sensitive key- 
words. The inference (W1,...,Wx) = KK), indicates 
that the keywords (W1,..., W;), found in the collection 
C, together with the knowledge present in R allow for 
inference of the sensitive keywords Kg. The algorithm 
returns an empty list if it fails to detect any sensitive in- 
ference. 


PARAMETERS: The algorithm is parameterized by a 
value a that controls the depth of the NLP analysis of 
the documents in C, by two values (@ and 7+ that control 
the search depth for documents in 7? that are related to 
C, and finally by a value 6 that controls the depth of the 
NLP analysis of the documents retrieved by the search 
algorithm. The values a, (3, and 6 are all positive in- 
tegers. They can be tuned to achieve different trade-offs 
between the running time of the algorithm and the com- 
pleteness and quality of inference detection. 


UNDERSTANDING THE DOCUMENTS IN C. Our basic 
algorithm uses TF.IDF (term frequency - inverse docu- 
ment frequency, see [28] and section 5.1) to extract from 
each document C;; in the collection C the top a keywords 
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that are most representative of C’;. Let S; denote the set 
of the top a keywords extracted from document C;, and 
let S = UP, S;. 


INFERENCE DETECTION. The list £ of inferences is ini- 
tially empty. We consider in turn every subset S’ C 
S of size |S'| < (. For every such subset S’ = 
(Wi,...,W«), with k < G, we do the following: 


1. We use a search engine to retrieve from the collec- 
tion R of reference documents the top y documents 
that contain all the keywords W,,..., We. 


2. With TEIDF, we extract the top 6 keywords from 
this collection of 7 documents. Note that these key- 
words are extracted from the aggregate collection of 
7 documents (as if all these documents were con- 
catenated into a single large document), not from 
each individual document. 


3. Let KG denote the intersection of the 6 keywords 
from step 2 with the set K* of sensitive keywords. 
If K5 is non-empty, we add to £ the inference C’ > 
KG. 


The algorithm outputs the list £ and terminates. 


3.3. Variants of the Algorithm 


The algorithm of section 3.2 can be tailored to a variety 
of applications. Two such applications are discussed in 
exhaustive detail in section 5. Here, we discuss briefly 
other possible variants of the basic algorithm. 


DETECTING ALL INFERENCES. In some applications, 
the set of sensitive knowledge K* may not be known or 
may not be specified. Instead, the goal is to identify all 
possible inferences that arise from knowledge of the col- 
lection of documents C and the reference collection 7. 
A simple variation of the algorithm given in 3.2 handles 
this case. In step 3 of the inference detection phase, we 
record all inferences instead of only inferences that in- 
volve keywords in K*. Note that this is equivalent to 
assuming that the set K* of sensitive knowledge consists 
of all knowledge. The algorithm may also track the num- 
ber of occurrences of each inference, so that the list £ can 
be sorted from most to least frequent inference. 


ALTERNATIVE REPRESENTATION OF _ SENSITIVE 
KNOWLEDGE. The algorithm of section 3.2 assumes 
that the sensitive knowledge K* is given as a set of 
keywords. Other representations of sensitive knowledge 
are possible. In some applications for example, sensitive 
knowledge may consist of a topic (e.g. alcoholism, 
or sexually transmitted diseases) instead of a list of 
keywords. To handle this case, we need a pre-processing 
step which converts a sensitive topic into a list of 


sensitive keywords. One way of doing so is to issue a 
search query for documents in the reference collection 
R that contain the sensitive topic, then use TRIDF 
to extract from these documents an expanded set of 
sensitive keywords. 


4 Example Applications 


This section describes a wide array of potential applica- 
tions for Web-based inference detection. All these appli- 
cations are based on the fundamental algorithm of sec- 
tion 3. The first two applications are the subjects of the 
experiments described in detail in section 5. Experiment- 
ing with other applications will be the subject of future 
work. 


REDACTION OF MEDICAL RECORDS. Medical records 
are often released to third parties such as insurance com- 
panies, research institutions or legal counsel in the case 
of malpractice lawsuits. State and federal legislation 
mandates the redaction of sensitive information from 
medical records prior to release. For example, all ref- 
erences to drugs and alcohol, mental health and HIV sta- 
tus must typically be redacted. This redaction task is far 
more complex than it may initially appear. Extensive and 
up-to-date knowledge of diseases and drugs is required to 
detect all clues and combinations of clues that may allow 
for inference of sensitive information. Since this medical 
information is readily available on public websites, the 
process of redacting sensitive information from medical 
records can be partially automated with Web-based infer- 
ence control. Section 5.3 reports on our experiments with 
Web-based inference detection for medical redaction. 


PRESERVING INDIVIDUAL ANONYMITY. Intelligence 
and other governmental agencies are often forced by law 
(such as the Freedom of Information Act) to release pub- 
licly documents that pertain to a particular individual or 
group of individuals. To protect the privacy of those con- 
cerned, the documents must be released in a form that 
does not allow for unique identification. This problem is 
notoriously difficult, because seemingly innocuous infor- 
mation may allow for unique identification, as illustrated 
by the poorly redacted Osama Bin Laden biography [8] 
discussed in the introduction. Web-based inference con- 
trol is perfectly suited to the detection of indirect infer- 
ences based on publicly available data. Our tools can 
be used to determine how much information can be re- 
leased about a person, entity or event while preserving k- 
anonymity, i.e. ensuring that it remains hidden in a group 
of like-entities of size at least k, and cannot be identified 
any more precisely within the group. Section 5.2 reports 
on our experiments with Web-based inference detection 
for preserving individual anonymity. 
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FORMULATION OF REDACTION RULES. Our Web-based 
inference detection tools can also be used to pre-compute 
a set of redaction rules that is later applied to a collection 
of private documents. For a large collection of private 
documents, pre-computing redaction rules may be more 
efficient than using Web-based inference detection to an- 
alyze each and every document. In 1995 for example, 
executive order 12958 mandated the declassification of 
large amounts of government data [9] (hundreds of mil- 
lions of pages). Sensitive portions of documents were to 
be redacted prior to declassification. The redaction rules 
were exceedingly complex and formulating them was 
reportedly nearly as time-consuming as applying them. 
Web-based inference detection is an appealing approach 
to automatically expand a small set of seed redaction 
rules. For example, assuming that the keyword “mis- 
sile” is sensitive, web-based inference detection could 
automatically retrieve other keywords related to missiles 
(e.g. “guidance system’, “ballistics”, “solid fuel’) and 
add them to the redaction rule. 


PUBLIC IMAGE CONTROL. This application considers 
the problem of verifying that a document conforms to 
the intentions of its author, and does not accidentally re- 
veal private information or information that could eas- 
ily be misinterpreted or understood in the wrong con- 
text. This application, unlike others, does not assume 
that the set of unwanted inferences is known or explic- 
itly defined. Instead, the goal of this application is to 
design a broad, general-purpose tool that helps contex- 
tualize information and may draw an author’s attention 
to a broad array of potentially unwanted inferences. For 
example, Web-based inference detection could alert the 
author of a blog to the fact that a particular posting con- 
tains a combination of keywords that will make the blog 
appear prominently in the results of some search query. 
This problem is related to other approaches to public im- 
age management, such as [13, 31]. Few technical details 
have been published about these other approaches, but 
they do not appear focused on inference detection and 
control. 


LEAK DETECTION. This application helps a data owner 
avoid accidental releases of information that was not pre- 
viously public. In this application of Web-based infer- 
ence control, the set of sensitive knowledge K™* consists 
of all information that was not previously public. In other 
words, the release of private data should not add anything 
to public knowledge. This application may have helped 
prevent, for example, a recent incident in which Google 
accidentally released confidential financial information 
in the notes of a PowerPoint presentation distributed to 
financial analysts [22]. 


5 Experiments 


Our experiments focus on exploring the first two pri- 
vacy monitor applications of section 4: redaction of med- 
ical records and preserving individual anonymity. In 
testing these ideas, we faced two main challenges that 
constrained our experimental design. First, and most 
challenging, was designing relevant experiments that we 
could execute given available data. The second, more 
pragmatic, challenge was getting the right tools in place 
and executing the experiments in a time-efficient manner. 
We describe each of these challenges, and our approach 
to meeting them, in more detail below. 


5.1 Experimental Design Challenges and 
Tools 


Ideally, our idea of Web-based inference detection would 
be tested on authentic documents for which privacy is a 
chief concern. For example, a corpus of medical records 
being prepared for release in response to a subpoena 
would be ideal for evaluating the ability of our tech- 
niques to identify sensitive topics. However, such a cor- 
pus is hard to come by for obvious reasons. Similarly, 
a collection of anonymous blogs would be ideal for test- 
ing the ability of our techniques to identify individuals, 
but such blogs are hard to locate efficiently. Indeed, the 
excitement over the recently released AOL search data, 
as illustrated by the quick appearance of tools for min- 
ing the data (see, for example, [44, 4]), demonstrates the 
widespread difficulty in finding data appropriate for vet- 
ting data mining technologies, of which our inference de- 
tection technology is an example.* 

Given the difficulties of finding unequivocally sensi- 
tive data on which to test our algorithms, we used in- 
stead publicly available information about an individual, 
which we anonymized by removing the individual’s first 
and last names. In most cases, the public information 
about the individual, thus anonymized, appeared to be a 
decent substitute for text that the individual might have 
authored on their blog or Web page. 

All of our experiments rely on Java code we wrote 
for extracting text from html, on calculation of an ex- 
tended form of TFIDF (see definition below) for identi- 
fying keywords in documents and on the Google SOAP 
search API [18] for making Web queries based on those 
keywords. 

Our code for extracting text from html uses standard 
techniques for removing html tags. Because our experi- 
ments involved repeated extractions from similarly for- 
matted html pages (e.g Wikipedia biographies) it was 
most expedient to write our own code, customized for 
those pages, rather than retrofitting existing text extrac- 
tion code such as is available in [3]. 
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As mentioned above, in order to determine if a word 
is a keyword we use the well known TEIDF metric (see, 
for example, [28]). The TRIDF “rank” of a word in a 
document is defined with respect to a corpus, C. We 
state the definition next. 


Definition 1 Let D be a document that contains the 
word W and is part of a corpus of documents, C. The 
term frequency (TF) of W with respect to D is the num- 
ber of times W occurs in D. The document frequency 
(DF) of W with respect to the corpus, C, is the total num- 
ber of documents in C' that contain the keyword W. The 
TFIDF value associated with W is the ratio: TF'/DF. 


Our code implements a variant of TEKIDF in which we 
first use the British National Corpus (BNC) [27] to stem 
lexical tokens (e.g. the tokens “accuse”, “accused”, “‘ac- 
cuses” and “‘accusing” would all be mapped to the stem 
“accuse”). We then use the BNC again to associate with 
each token the DF of the corresponding stem (i.e. “‘ac- 
cuse” in the earlier example). 

As with text extraction from html, there are open 
source (and commercial) offerings for calculating 
TEIDF based on a reference corpus. We did not, how- 
ever, have a reference corpus on which to base our cal- 
culations, and thus opted to write our own code to com- 
pute TF.IDF based on the DF values reported in the BNC 
(which is an excellent model for the English language as 
a whole, and thus presumably also for text found on the 
Web). 

Our final challenge was experimental run-time. Al- 
though we did not invest time optimizing our text ex- 
traction code for speed it nevertheless proved remark- 
ably efficient in comparison with the time needed to ex- 
ecute Google queries and download Web pages. In addi- 
tion, Google states that they place a constraint of 1, 000 
queries per day for each registered developer on the 
Google SOAP Search API service [18]. This constraint 
required us to amass enough Google registrations in or- 
der to ensure our experiments could run uninterrupted; in 
our case, given the varying running times of our experi- 
ments, 17 registrations proved enough. The delay caused 
by query execution and Web page download caused us to 
modify our algorithms to do a less thorough search for 
inferences than we had originally intended. These modi- 
fications almost certainly cause our algorithms to gener- 
ate an incomplete set of inferences. However, it is also 
important to note that despite our efforts, our results con- 
tain some links that should have been discarded because 
they either don’t represent new information (e.g. scrapes 
of the site from which we extracted keywords) or don’t 
connect the keywords in our query to the sensitive words 
in a meaningful way (e.g. an online dictionary covering a 
broad swath of the English language). Hence, it is possi- 
ble to improve upon our results by changing the parame- 


ters of our basic experiments to either do more filtering 
of the query results or analyze more of the query results 
and require a majority contain the sensitive word(s). 

We describe each experiment in detail below. 


5.2 Web-based De-anonymization 


As discussed in section 4 one of our goals is to demon- 
strate how keyword extraction can be used to warn the 
end-user of impending identification. Our inference 
detection technology accomplishes this by constantly 
amassing keywords from online content proposed for 
posting by the user (e.g. blog entries) and issuing Web 
queries based on those keywords. The user is alerted 
when the hits returned by those queries return their name, 
and thus is warned about the risk of posting the content. 

We simulated this setting with Wikipedia biographies 
standing in for user-authored content. We removed 
the biography subject’s name from the biography and 
viewed the personal content in the biography as being 
a condensed version of the information an individual 
might reveal over many posts to their blog, for example. 
From these “anonymized” biographies we extracted key- 
words. Subsets of keywords formed queries to Google. 
A portion of the returned hits were then searched for the 
biography subject’s name and a flag was raised when a 
hit that was not a Wikipedia page contained a mention 
of the biography’s subject. For efficiency reasons, we 
limited the portion and number of Web pages that were 
examined. In more detail, our experiment consists of the 
following steps: 


Input: a Wikipedia biography, B: 


1. Extract the subject, N, of the biography, B, and 
parse NV into a first name, V1, optional middle name 
or middle initial, Nj, and a last name, N2 (where 
N; is empty if a name in that position is not given 
in the biography).> 

2. Extract the top 20 keywords from the Wikipedia bi- 
ography, B, forming the set, Siz, through the fol- 
lowing steps: 


(a) Extract the text from the html. 

(b) Calculate the enhanced TEIDF ranking of 
each word in the extracted text (section 5.1). 
If present, remove N,, Nj and N2 from this 
list, and select the top 20 words from the re- 
maining text as the ordered set, S'p. 


3. For x = 20,19,...,1, issue a Google query on the 
top « keywords in Sg. Denote this query by Qz. 
For example, if W1, W2, W3 are the top 3 keywords, 
the Google query Q3 is: Wy; Wo Ws, with no 
additional punctuation. Let 7/, be the set of hits 
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Identifiability of California individuals with Wikipedia 
biographies 
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Figure |: Using 20 keywords per person, extracted from each resident’s Wikipedia biography, the percentage of individuals who were identifiable 
based on x keywords or less for x = 1,..., 20. The graph on the left shows results for the 234 biographies of California residents in Wikipedia 
and the graph on the right shows the results for the 106 biographies of Illinois residents in Wikipedia. 


returned by issuing query @,, to Google with the re- 
strictions that the hits consist solely of html or text® 
and that no hits from the en.wikipedia.org Web site 
be returned. 

4. Let Hy 1, Hz2,Hz,3 © Hy be the first, second and 
third hits (respectively) resulting from query Q,.’ 
For x = 20,19,..., 1, determine if H;,1, Hz,2 and 
H,,3 contain references to subject, N, by search- 
ing for contiguous occurrences of N,, Nj and No 
(meaning, no words appear in between the words in 
a name) within the first 5000 lines of html in each of 
Ay,1, Hz,2 and H,z,3. Record any such occurrences. 


Output: S'z, each query Q, that contains N,, Nj and 
Nz contiguously in at least one of the three examined 
hits, and the url of the particular hit(s). 


We ran this test on the 234 biographies of California 
residents, and the 106 biographies of Illinois residents 
contained in Wikipedia. The results for both states are 
shown in Figure 1 and are very similar. In each case, 10 
or fewer keywords (extracted from the Wikipedia biog- 
raphy) suffice to identify almost all the individuals. Note 
that statistics in Figure | are based solely on the output 
of the code, with no human review. 

We also include example results (keywords, url, biog- 
raphy subject) in Figure 2. These results illustrate that 
the associations a person has may be as useful for identi- 
fying them as their personal attributes. To highlight one 
example from the figure, 50% of the first page of hits re- 
turned from the Google query “nfl nicole goldman fran- 
cisco pro” are about O. J. Simpson (including the top 3 
hits), but there is no reference to O. J. Simpson in any of 


the first page of hits returned by the query “nfl francisco 
pro”. Hence, the association of O. J. Simpson with his 
wife (Nicole) and his wife’s boyfriend (Goldman) is very 
useful to identifying him in the pool of professional foot- 
ball players who once were members of the San Fran- 
cisco 49ers. 


PERFORMANCE. In our initial studies, there was wide 
variation, from a few minutes to over an hour, in the total 
time it took to process a single biography, B, depending 
on the length of the Web pages returned and the num- 
ber of hits. Hence, in order to efficiently process a suf- 
ficiently large number of biographies we restricted the 
code to only examining the first 5000 lines of html in the 
returned hits from a given query, and to only search the 
first 3 hits returned from any given query. With these 
restrictions, each biography took around 20 minutes to 
process, with some variation due to differences in biog- 
raphy length. In total, our California experiments took 
around 78 hours and our IIlinois experiments took about 
35 hours. Our experimental code does not keep track 
of the number of queries issued per registration and do- 
ing so may yield better performance because switch- 
ing between registrations occurred only upon receiving 
a Google SOAP error and so caused some delay. 


Our code was not optimized for performance and im- 
provements are certainly possible. In particular, our main 
slow down came from the text extraction step. One im- 
provement would be to cache Web sites to avoid repeat 
extractions. 
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Keywords URL of Top Hit Name of Person 
campaigned soviets http://www.utexas.edu/features/archive/2004/election_policy.html Ronald Reagan 
defense contra reagan http://www.pbs.org/wgbh/amex/reagan/peopleevents/pande08.html Caspar Weinberger 





reagan attorney 
edit pornography 


http://www.sourcewatch.org/index.php?title=Edwin_Meese_III 


Edwin Meese 





nfl nicole goldman 
francisco pro 


http://www. brainyhistory.com/years/1997.html 


O. J. Simpson 





kung fu actors dp/BOOO6BAW YM 


http://www.amazon.com/Kung-Fu-Complete-Second-Season/ 


David Carradine 





medals medal raid 
honor aviation 


http://www.voicenet.com/Ipadilla/pearl.html 


Jimmy Doolittle 














fables chicago indiana http://www. indianahistory.org/pop_hist/people/ade.html Geor, ge Ade 
wisconsin illinois chicago 
architect designed http://www.greatbuildings.com/architects/Frank Lloyd Wright.html | Frank Lloyd Wright 











Figure 2: Excerpts from our de-anonymization experiments. Each row lists keywords extracted from the Wikipedia biography of an individual 
(categorized under “California” or “Tllinois”), a hit returned by a Google query on those keywords that is one of the top three hits returned and 


contains the individual’s name, and the name of the individual. 


5.3. Web-based Sensitive Topic Detection 


Another application of Web-based inference detection is 
the redaction of medical records. As discussed earlier, it 
is common practice to redact all information about dis- 
eases such as HIV/Aids, mental illness, and drug and 
alcohol abuse, prior to releasing medical records to a 
third party (such as, e.g., a judge in malpractice liti- 
gation). Implementing such protections today relies on 
the thoroughness of the redaction practitioner to keep 
abreast of all the medications, physician names, diag- 
noses and symptoms that might be associated with such 
conditions and practices. Web-based inference detection 
can be used to improve the thoroughness of this task by 
automating the process of identifying the keywords al- 
lowing such conditions to be inferred. 


To demonstrate how our algorithm can be used in this 
application, our experiments take as input a page that is 
viewed as authoritative about a certain disease. In our 
experiments, we used Wikipedia to supply pages for al- 
coholism and sexually transmitted diseases (STDs). The 
text is then extracted from the html, and keywords are 
identified. To identify keywords that might allow the 
associated disease to be inferred we then issued Google 
queries on subsets of keywords and examined the top hit 
for references to the associated disease. In general, we 
counted as a reference any mention of the associated dis- 
ease. The one exception to this rule is that we filtered out 
some medical term sites since such sites list unrelated 
medical terms together (for indexing purposes) and we 
didn’t want such lists to trigger inference results. 

In the event that such a reference was found we 
recorded those keywords as being potentially inference- 
enabling. In practice, a redaction practitioner might then 


use this output to decide what words to redact from med- 
ical records before they are released in order to preserve 
the privacy of the patient. 

To gain some confidence in our approach we also used 
a collection of general medical terms as a “control” and 
followed the same algorithm. That is, we made Google 
queries using these medical terms and looked for refer- 
ences to a sensitive disease (STDs and alcoholism) in the 
returned links. The purpose of this process was to see 
if the results would differ from those obtained with key- 
words from the Wikipedia pages about STDs and alco- 
holism. We expected a distinct difference because the 
Wikipedia pages should yield keywords more relevant to 
STDs and alcoholism, and indeed the results indicate that 
is the case. 

The following describes our experiment in more de- 
tail. 


1. Input: An ordered set of sensitive words, kK* = 
{v1,...,v»}, for some positive integer b, and a 
page, B. B is either the Wikipedia page for alco- 
holism [40], the Wikipedia page for sexually trans- 
mitted diseases (STDs) [41] or a “control” page of 
general medical terms. 


(a) If B is a Wikipedia page, extract the top 
30 keywords from B, forming the set Sz, 
through the following steps: 

i. Extract text from html. 

ii. Calculate the enhanced TRIDF ranking 
of each word in the extracted text (sec- 
tion 3). Select the top 30 words as the 
ordered set, Sp = {W1, Wo,..., Wo}. 

(b) If Bis a medical terms page, extract the terms 
using code customized for that Web site and 
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let We = {W1,Wo,...,Ws3o0} be a subset 
of 30 terms from that list, where the selec- 
tion method varies with each run of the exper- 
iment (see the results discussion below for the 
specifics). 

(c) For each pair of words {W;,W;} € Sz, let 
Q;,; be the query consisting of just those two 
words with no additional punctuation and the 
restriction that no pages from the domain of 
source page B be returned, and that all re- 
turned pages be text or html (to avoid parsing 
difficulties). Let H;,; denote the first hit re- 
turned after issuing query Q;,; to Google, after 
known medical terms Web sites were removed 
from the Google results®. 

(d) For all 7,7 € {1,...,30}, ¢ A J, and for 
+ {1,...,0}, search for the string ve € K* 
in the first 5000 lines of H;;. If vg is found, 
record vg, w;, w; and H; ; and discontinue the 
search. 


2. Output: All triples (ve, Qi,;, Hi,;) found in step 1, 
where v¢ is in the first 5000 lines of H; ;. 


RESULTS FOR STD EXPERIMENTS. We ran the above 
test on the Wikipedia page about STDs [41], B, and a 
selected set, B’, of 30 keywords from the medical term 
index [29]. The set B’ was selected by starting at the 
49*” entry in the medical term index and selecting every 
400"” word in order to approximate a random selection 
of medical terms. As expected, keyword pairs from input 
B generated far more hits for STDs (306/435 > 70%) 
than keyword pairs from B’ (108/435 < 25%). The 
results are summarized in figure 3. 


RESULTS FOR ALCOHOLISM EXPERIMENTS. We ran 
the above test on the Wikipedia page about alcoholism 
[40], B, and a selected set, B’, of 30 keywords from the 
medical term index [29]. For the run analyzed in Fig- 
ure 4, the set B’ was selected by starting at the 52”4 entry 
in the medical term index and selecting every 100°” word 
until 30 were accumulated in order to approximate a ran- 
dom selection of medical terms. As expected, keyword 
pairs from input B generated far more hits for alcoholism 
(47.82%) than B’ (9.43%). In addition, we manually re- 
viewed the URLs that yielded a hit inv € K%,, fora 
seemingly innocuous pair of keywords. These results are 
summarized in figure 4. 


APPLYING THE RESULTS. When redacting medical 
records, a redaction practitioner might use the results in 
figures 3 and 4 to choose content to redact. For exam- 
ple, figure 4 indicates the medications naltrexone and 
acamprosate should be removed due to their popular- 
ity as alcoholism treatments. The words identified as 
STD-inference enabling are far more ambiguous (e.g. 


99 66 


“transmit”, “infected”). However, for some individuals 
the very fact that even general terms are frequently as- 
sociated with sensitive diseases may be enough to jus- 
tify redaction (e.g. a politician may desire the removal 
of any “red flag” words). In general though, we think 
a redaction practitioner could defensibly not make it a 
practice to redact such general terms given their associa- 
tion with other, less sensitive, diseases. This emphasizes 
that our techniques support semi-automation, but not full 
automation, of the redaction process. 


PERFORMANCE. Amortizing the cost of text extraction 
from the Wikipedia source page over all the queries, de- 
termining if each keyword pair yielded a top hit contain- 
ing a sensitive word took approximately 150 seconds. 
Hence, each of the experiments in figures 3 and 4 took 
around 6 hours, since 435 pairs from the Wikipedia page 
were tested along with 435 pairs from the “control” set 
of keywords. 

As in the de-anonymization experiments, our main 
time cost was due to the process of text extraction from 
html. For these experiments caching is likely to signifi- 
cantly improve performance as many of the medical re- 
source sites were visited multiple times. 


6 Use Scenario: Iterative Redaction 


As mentioned in sections 1 and 4, the process of sani- 
tizing documents by removing obviously identifying in- 
formation like names and social security numbers can 
be improved by using Web-based inference detection to 
identify pieces of seemingly innocuous information that 
can be used to make sensitive inference. To illustrate this 
idea, we return to the poorly redacted FBI document in 
the left-hand side of figure 6. Algorithms like those pre- 
sented in sections 3.2 and 5 can be used to identify sets 
of keywords that allow for undesired inferences. Some 
or all of those keywords can then be redacted to improve 
the sanitization process. 

We emphasize that the strategy for redacting based 
upon the inferences detected by our algorithms is a re- 
search problem that is not addressed by this paper. In- 
deed many strategies are possible. For example, one 
might redact the minimum set of words (in which case, 
the redactor seeks to find a minimum set cover for the 
collection of sets output by the inference detection algo- 
rithm). Alternatively, the redactor might be biased in fa- 
vor of redacting certain parts of speech (e.g. nouns rather 
than verbs) to enhance readability of the redacted docu- 
ment. 

The type of redaction strategy that is employed may 
influence the Web-based inference detection algorithm. 
For example, if the goal is to redact the minimum set of 
words, then it is necessary to consider all possible sets 
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Summary of STD Experiments 
Input Web Page, B: Wikipedia STD site [41] 


Extracted Keywords, Sz: transmit, sexually, transmitting, transmitted, infection, std, sti, hepatitis, infected, 
infections, transmission, stis, herpes, viruses, virus, chlamydia,_’, stds, sexual, disease, hiv, membrane, genital, 
intercourse, diseases, pmid, hpv, mucous, viral, 2006 


Input Web Page, B’, (“‘control” page): Medical Terms Site [29] 


Extracted “Control” Keywords, S‘3: Ablation, Ah-Al, Aneurysm, thoracic, Arteria femoralis, Barosinusitis, 
Bone mineral density, Cancer, larynx, Chain-termination codon, Cockayne syndrome, Cranial nerve IX, 
Dengue, Disorder, cephalic, ECT, Errors of metabolism, inborn, Fear of nudity, Fracture, comminuted, Gland, 
thymus, Hecht-Beals syndrome, Hormone, thyroxine, Immunocompetent, Iris melanoma, Laparoscopic, Lung 
reduction surgery, Medication, clot-dissolving, Mohs surgery, Nasogastric tube, Normoxia, Osteosarcoma, 
PCR (polymerase chain reaction), Plan B 


Sensitive Keywords, K§-p: STD, Chancroid, Chlamydia, Donovanosis, Gonorrhea, Lymphogranuloma 
venereum, Non-gonococcal urethritis, Syphilis, Cytomegalovirus, Hepatitis B, Herpes, HSV, Human Immun- 
odeficiency Virus, HIV, Human papillomavirus, HPV, genital warts, Molluscum, Severe acute respiratory 
syndrome, SARS, Pubic lice, Scabies, crabs, Trichomoniasis, yeast infection, bacterial vaginosis, trichomonas, 
mites, nongonococcal urethritis, NGU, molluscum contagiosm virus, MCV, Herpes Simplex Virus, Acquired 
immunodeficiency syndrome, aids, pubic lice, HTLV, trichomonas, amebiasis, Bacterial Vaginosis, Campy- 
lobacter Fetus, Candidiasis, Condyloma Acuminata, Enteric Infections, Genital Mycoplasmas, Genital Warts, 
Giardiasis, Granuloma Inguinale, Pediculosis Pubis, Salmonella, Shingellosis, vaginitis 


Percentage of words in Sz yielding a top hit containing word(s) in K 7p: 33.33% 

Percentage of word pairs in S‘z yielding a top hit containing word(s) in K$7p: 70.34% 
Percentage of “control” words in S‘, yielding a top hit containing word(s) in K7p: 3.33% 
Percentage of “control” word pairs in S$‘, yielding a top hit containing word(s) in Kp: 24.83% 


Example keyword pairs from Sz returning a top hit containing a word in K3;-p:'° 















































Keywords URL of Top Hit Sensitive Word 
in Top Hit 

transmit, infected http://www.rci.rutgers.edu/ insects/aids.htm HIV 

transmit, mucous http://research.uiowa.edu/animal/?get=empheal Herpes 

transmitting, viruses http://www.cdc.gov/hiv/resources/factsheets/transmission.htm Hepatitis B 

transmitted, viral http://www.eurosurveillance.org/em/v10n02/1002-226.asp Hepatitis B 

transmitted, infection | http://www.plannedparenthood.org/sti/ STD 

transmitted, disease http://www.epigee.org/guide/stds.html STD 

infection, mucous http://www.niaid.nih.gov/factsheets/sinusitis.htm HIV 

infected, disease http://www.ama-assn.org/ama/pub/category/1797.html HIV 

infected, viral http://www.merck.com/mmhe/sec 17/ch198/ch198a.html Cytomegalovirus 

infections, viral http://www.nlm.nih.gov/medlineplus/viralinfections.html Cytomegalovirus 

virus, disease http://www.mic.ki.se/Diseases/C02.html Cytomegalovirus 














Figure 3: Summary of experiments to identify keywords enabling STD inferences. 





USENIX Association 16th USENIX Security Symposium 


81 





Summary of Alcoholism Experiments 
Input Web Page, B: Wikipedia Alcoholism site [40] 
Extracted Keywords, Sz: alcoholism, alcohol, drunk, alcoholic, alcoholics, naltrexone, drink, addiction, 
dependence, detoxification, diagnosed, screening, drinks, moderation, abstinence, 2006, disorder, drinking, 
behavior, questionnaire, cage, treatment, citation&#160, acamprosate, because, pharmacological, anonymous, 
extinction, sobriety, dsm 
Input Web Page, B’, (“control page): Medical Terms Site [29] 
Extracted “Control” Keywords, Siz: ABO blood group, Alarm clock headache, Ankle-foot orthosis, 
Ascending aorta, Benign lymphoreticulosis, Breast bone, Carotid-artery stenosis, Chondromalacia patellae, 
Congenital, Cystic periventricular leukomalacia, Discharge, DX, Enterococcus, Familial Parkinson disease 
type 5, Fondation Jean Dausset-CEPH, Giant cell pneumonia, Heart attack, Hormone, parathyroid, Impetigo, 
Itching, Laughing gas, M. intercellulare, Membranous nephropathy, MRSA, Nerve palsy, laryngeal, Oligoden- 
drocyte, Pap Smear, Phagocytosis, Postoperative, Purpura, Henoch-Schonlein 
Sensitive Keywords, /%,.: alcoholism, alcoholic(s), alcohol 
Percentage of words in Sz yielding a top hit containing word(s) in K 4): 23.33% 
Percentage of word pairs in Sz yielding a top hit containing word(s) in K4,,: 47.82% 
Percentage of “control” words in S'z yielding a top hit containing word(s) in K%4,,: 0.00% 


Percentage of “control” word pairs in S'z yielding a top hit containing word(s) in K4,,: 9.43% 


Example word sets from Sz returning a top hit containing a word in K%,,:'' 





























| Keywords URL of Top Hit 

| naltrexone http://www.nlm.nih.gov/medlineplus/druginfo /medmaster/a685041 html 
| acamprosate http://www.nlm.nih.gov/medlineplus/druginfo /medmaster/a604028.html 
| dsm, detoxification http://www.aafp.org/afp/20050201/495.html'? 

| dsm, detoxification, dependence | http://www.aafp.org/afp/20050201/495.html 








Figure 4: Summary of experiments to identify keywords enabling alcoholism inferences. 

















Redacted Word(s) Example Link Sensitivity of Word(s) 
http://multimedia.belointeractive.com/attack Having 50 or more siblings is very 

50, 52, 54 /binladen/1004blfamily.html characteristic of Osama Bin Laden. 
http://www.time.com/time/magazine/article 

Boston /0,9171,1000943,00.html?promoid=googlep Many of Osama’s relatives reside in Boston.” 

magnate http:/Avww.outpostoffreedom.com/bin_ladin.htm | Osama’s father was a building magnate. 





denounced, denunciation | http:/Avww.cairnet.org/htmI/911statements.html A number of groups (including Bin Laden’s 
family) have denounced his actions. 





http://www.usnews.com/usnews/politics/whispers | A number of groups (including Bin Laden’s 
condemnation /archive/september2001.htm family) have condemned his actions. 

















Figure 5: Words redacted as a result of Web-based inference detection. Column 1 is the word or words, column 2 is a link using those words 
output by the algorithm, and column 3 explains why the word(s) are sensitive. 
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of keywords when looking for inferences. In contrast, if 
readability is an important concern, then the considered 
sets might be those favoring certain word types. 

What we discuss here is one example of using 
Web-based inference detection to improve the redaction 
process. The approach we take is influenced by readabil- 
ity and performance (i.e. speed of the redaction process) 
but is by no means an optimal approach with respect 
to either concern. We began by applying some simple 
redaction rules to the document [8]. Specifically, we re- 
moved all location references since our example in sec- 
tion 1 indicated those were important to identifying the 
biography subject, any dates near September 11, 2001, 
which is clearly a memorable date, and finally, all cita- 
tion titles since when paired with the associated publi- 
cation, these enable the citation articles to be easily re- 
trieved. The resulting redacted document is depicted in 
figure 6, where grey rectangles indicate the redaction re- 
sulting from the rules just described. 

Our subsequent redaction proceeded iteratively. At 
each stage, we extracted the text from the current doc- 
ument, calculated the keywords ordered by the TF.IDF 
metric and searched for inferences drawn from subsets 
of a specified number of the top keywords. We then eval- 
uated the output of the algorithm by checking to ensure 
the produced links did indeed reflect identifying infer- 
ences. If a link did not use all the queried keywords in a 
discussion about Osama Bin Laden then it was deemed 
invalid. A common source of invalid links were news ar- 
ticle titles printed in the side-bar of the link that did not 
make use of the keywords found in the main body. For 
example, the query “condone citing prestigious”, yields 
the top hit [6] (a humor site) because a sidebar links to 
an article with “Osama” in the title, however, none of the 
keywords are used in the description of that article.!4 

We incorporated manual review of the links because 
the current form of our algorithms involves too little con- 
tent analysis to provide confidence that a returned link 
reflects a strong connection between the associated key- 
words and Osama Bin Laden. In addition, given the high 
security nature of most redaction settings it is unlikely 
that a purely automated process will ever be accepted. 

For those inferences that were found valid, we made 
redactions to prevent such inferences and repeated this 
process for the newly redacted document. The following 
makes the steps we followed precise. 


1. Dates near September 11, 2001, titles of all citations 
and location names were removed from the biogra- 
phy [8]. 

2. Fort = 2,...,5: 


(a) We executed Google queries for each 7-tuple 
in the top n; keywords in the biography. The 
mn, values were chosen based on performance 


constraints as described in section 5.'5 The 
(i,n;) values were: (2,50), (3,20), (4, 15) 
and (5,13). We concluded with 5-tuples be- 
cause no valid inferences were found for that 
run of the algorithm, and only 7% of the links 
returned by the algorithm run for (¢,n;) = 
(4,15) were valid. For each (i,n,;) execu- 
tion of the algorithm we received a list of sets 
of keywords that were potentially inference- 
enabling, and the associated top link leading 
the algorithm to make this conclusion. 

(b) We reviewed the returned links to see if all the 
corresponding keywords were used in a dis- 
cussion of Osama Bin Laden. If so, we made 
a judgement as to which keyword or keywords 
to remove to remove the inference while pre- 
serving readability of the document. 

(c) We incremented i and returned to step (a) with 
the current form of the redacted document. 


Figure 5 lists the words that were redacted as a result 
of our Web-based inference detection algorithm. The ta- 
ble also gives an example link output by the algorithm 
that motivates the redaction and a brief explanation of 
why the word is sensitive (gained from the manual re- 
view of the link(s)). Note that while our algorithm found 
some document features to be identifying that are un- 
likely to have been covered by a generic redaction rule 
(e.g. Osama Bin Laden’s father’s attribute of being a 
building magnate) it left other, seemingly unusual, at- 
tributes (such as Osama Bin Laden potentially being one 
of 20 children). Since the Web is at best a proxy for hu- 
man knowledge, and our algorithm used the Web in a 
limited way (i.e. our analysis was limited to a few hits 
with little NLP use), it seems likely that inferences were 
missed. Hence, we emphasize that our tool is best used 
to semi-automate the redaction process. 


Finally, we note that the act of redacting informa- 
tion may introduce as well as remove, privacy problems. 
For example, as noted by Vern Paxson [39], redacting 
“Boston” without redacting “Globe” may allow the sen- 
sitive term “Boston” to be inferred. Our tool suggests 
“Boston” for redaction, as opposed to “Boston Globe”, 
because a number of Osama Bin Laden’s relatives reside 
there, however, acting on this recommendation is prob- 
lematic precisely because of the difference between the 
nature of the inference and the document usage of the 
term. An improved algorithm would understand the use 
of the term within the document and use this to guide the 
redaction process. 


Our final redacted document is shown in the right hand 
side of figure 6. 
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Figure 6: The left picture shows the original FBI-redacted biography. The right hand side shows the document resulting 
from using the Web-based inference detection algorithm, where black rectangles represent redactions recommended 
by the algorithm and grey rectangles are redactions coming from removing dates in 2001, locations and the titles of 
cited articles (i.e. the grey and black rectangles are redactions made by the authors of this paper). 
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7 Conclusion 


We have introduced the notion of using the Web to detect 
undesired inferences. Our proof-of-concept experiments 
demonstrate the power of the Web for finding the key- 
words that are likely to identify a person or topic. 

As is to be expected with an initial work, there re- 
mains a lot of room for improvement in the algorithms. 
In particular, to produce an inference detection tool ca- 
pable of functioning in real-time, as is needed in some 
applications, improvements already discussed such as 
Web caching, additional filtering of results to improve 
precision, and deeper hit analysis to improve recall, are 
needed. Another avenue for improvement is through 
deeper content analysis (i.e. beyond keyword extrac- 
tion). For example, employing a tool capable of deeper 
semantic analysis such as [15] may allow for both more 
meaningful extraction of words and phrases for generat- 
ing queries, and improved analysis of the returned hits 
for more accurate inference detection. In addition, sim- 
ple improvements to the content analysis such as bet- 
ter filtering of stop words and html syntax, would create 
more useful keyword lists. 
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'http://www.popandpolitics.com/2005/09/06/and-lite-jazz-singers- 
shall-lead-the-way/, www.popandpolitics.com/2006/10/06/our-paris/ 


2http://en.wikipedia.org/wiki/ 
Madonna-_and_the_gay_community, 
http://gaybookreviews.info/review/2807/615, 
http://www. youtube.com/results?search_type=related 
&search_query=madonna%20oh%20father 


3Example results from our experiments appear in section 5. Be- 
cause of the dynamic nature of the Web, issuing the same queries today 
may yield somewhat different results. 


4The AOL data can potentially be used to demonstrate the Web’s 
ability to de-anonymize ([5] may be one such example), which is one 
of the goals of our algorithms, however because our target application 
is the protection of English language content, we opted not to vet our 
algorithms with that data. 


5The vast majority of the biographies we used identified their sub- 
ject by both a first and last name with no middle name or initial. Also, 
name suffixes (e.g. Jr. or annotations made by Wikipedia authors re- 
garding profession), were ignored. 


This was done to avoid difficulties parsing non-ascii pages. 


7These are the first three links that appear on the results page, 
whether or not one URL is a substring of another. 


8Here “known site” means any site with “medterm” or “medword” 
in the URL. As this certainly not sufficient to remove all medical terms 
sites, we manually reviewed the results before generating the example 
keyword pairs in Figure 3. 


°Note this extracted non-word indicates a flaw in our text-from-html 
extraction algorithm. 


10Tm a manual review of the word pairs from Wy, yielding a top hit 
containing word(s) in Kp, we did not find any hits using the word 
pair in a meaningful way in relation to a sensitive word. Rather, the hits 
generally turned out to be medical term lists. 


'1Since all of our sensitive words pertain to the same topic, alco- 
holism, we did not record which particular sensitive word was con- 
tained in the top hit (if any). 


!2Note this is the 4*” returned hit, indicating a change in our search 
strategy would improve recall. 


13The biography only mentions “Boston” in a citation, so this is a 
conservative redaction choice. 


14 Alternative metrics for validity are of course possible. For exam- 
ple, a more thorough algorithms might look for shared topic (e.g. the 
events of September 11, 2001) amongst links, and retain any links per- 
taining to the most popular topic as valid. 


'5We tended to experience problems communicating with Google 
when when executing algorithm runs that exceeded 1500 queries, 
hence we chose values of {n;}; that yielded query counts in the range 
of 1000 — 1500. 
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Abstract 


Modern smartcards, capable of sophisticated cryptogra- 
phy, provide a high assurance of tamper resistance and 
are thus commonly used in payment applications. Al- 
though extracting secrets out of smartcards requires re- 
sources beyond the means of many would-be thieves, 
the manner in which they are used can be exploited for 
fraud. Cardholders authorize financial transactions by 
presenting the card and disclosing a PIN to a terminal 
without any assurance as to the amount being charged 
or who is to be paid, and have no means of discerning 
whether the terminal is authentic or not. Even the most 
advanced smartcards cannot protect customers from be- 
ing defrauded by the simple relaying of data from one 
location to another. We describe the development of 
such an attack, and show results from live experiments 
on the UK’s EMV implementation, Chip & PIN. We dis- 
cuss previously proposed defences, and show that these 
cannot provide the required security assurances. A new 
defence based on a distance bounding protocol is de- 
scribed and implemented, which requires only modest 
alterations to current hardware and software. As far as 
we are aware, this is the first complete design and imple- 
mentation of a secure distance bounding protocol. Fu- 
ture smartcard generations could use this design to pro- 
vide cost-effective resistance to relay attacks, which are a 
genuine threat to deployed applications. We also discuss 
the security-economics impact to customers of enhanced 
authentication mechanisms. 


1 Introduction 


Authentication provides identity assurance for, and of, 
communicating parties. Relay, or wormhole attacks al- 
low an adversary to impersonate a participant during an 
authentication protocol by effectively extending the in- 
tended transmission range for which the system was de- 
signed. Relay attacks have been described since at least 


1976 [13, p75] and are simple to execute as the adver- 
sary does not need to know the details of the protocol 
or break the underlying cryptography. A good example 
is a relay attack on proximity door-access cards demon- 
strated by Hancke [16]. To gain access to a locked door, 
the adversary simply relays the challenges from the door 
to an authorized card, possibly some distance away, and 
sends the responses back. The only restriction on the at- 
tacker is that the signals arrive at the door and remote 
card within the allotted time, which Hancke showed to 
be sufficiently liberal. Another example is wormhole at- 
tacks on wireless networks by Hu et al. [18]. Despite the 
existence of such attacks, systems susceptible to them are 
regularly being deployed. One significant reason is that 
designers consider relay attacks to be too difficult and 
costly for attackers to deploy. Section 3 aims to show 
that relay attacks are indeed practical, using as an exam- 
ple the UK’s EMV payment system, Chip & PIN. These 
flaws are demonstrated by an implementation of the relay 
attack that has been tested on live systems. 

Once designers appreciate the risk, the next step in 
building a secure system is to develop defences. Sec- 
tion 4 describes potential countermeasures to the relay 
attack and compares their cost and effectiveness. While 
some, which depend on procedural changes, could be de- 
ployed quickly and act as an interim measure, none of the 
conventional technologies meet our requirements of ade- 
quate security at low cost. We thus propose an extension 
to the smartcard standard, based on a distance bounding 
protocol, which provides adequate resistance to the relay 
attack and requires minimal changes to smartcards. 

Section 5 describes this countermeasure and its rela- 
tionship with prior work, describes a circuit design, and 
evaluates its performance and security properties. We 
have implemented the protocol on an FPGA and shown 
it to be an effective defence against very capable ad- 
versaries. In addition, the experience of both users and 
merchants is unchanged, a significant advantage over the 
other proposals we discuss. The impact of this protocol 
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on the fraud liability landscape is discussed in Section 6. 

Our contributions include the description of the prac- 
ticalities of relay attacks and our confirmation that de- 
ployed systems are vulnerable to them. By designing and 
testing a prototype system for demonstrating this vulner- 
ability, we show that the attack is feasible and an eco- 
nomically viable threat. Also, we detail the design of a 
distance bounding protocol for smartcards, discuss im- 
plementation issues and present results from both nor- 
mal operation and under simulated attacks. While papers 
have previously discussed distance bounding protocols, 
to the best of our knowledge, this is the first time it has 
been implemented in practice. 


2 Background 


Contact smartcards, also known as integrated circuit 
cards (ICC), as discussed in this paper, are defined by 
ISO 7816 [19] (for brevity, our description of the spec- 
ification will be only to the detail sufficient to illustrate 
our implementation). The smartcard consists of a sheet 
of plastic with an integrated circuit, normally a special- 
ized microcontroller, mounted on the reverse of a group 
of eight contact pads. Current smartcards use only five 
of these: ground, power, reset, clock are inputs supplied 
by the card reader, and an additional bi-directional asyn- 
chronous serial I/O signal over which the card receives 
commands and returns its response. Smartcards are de- 
signed to operate at clock frequencies between | and 
5 MHz, with the data rate, unless specified otherwise, of 
1/372 of that frequency. 

Upon insertion of a smartcard, the terminal first sup- 
plies the power and clock followed by de-assertion of re- 
set. The card responds with an Answer-to-Reset (ATR), 
selecting which protocol options it supports, including 
endianness and polarity, flow control, error correction 
and data rate. All subsequent communications are ini- 
tiated by the terminal and consist of a four byte header 
command with an optional variable-length payload. 


2.1 Payment environment 


There are four parties in the basic payment model: the 
cardholder purchasing the goods or service; the mer- 
chant supplying the goods or service and who controls 
the payment terminal; the issuer bank is in a contractual 
relationship with the cardholder and issues their card, 
and; the acquirer bank that is in a contractual relation- 
ship with the merchant. 

To initiate a transaction, the cardholder presents the 
merchant with his card and agrees to make the payment 
in exchange for goods or services. The merchant vali- 
dates that the card is authentic and that the cardholder is 
authorized to use it, and sends the transaction details to 


the acquirer. The acquirer requests transaction authoriza- 
tion from the issuer over a payment system network (e.g. 
Mastercard or Visa). If the issuer accepts the transaction, 
this response is sent back to the merchant via the acquirer 
and the cardholder is given the good or service. Later, the 
payment is transferred from the cardholder’s account at 
the issuer to the merchant’s account at the acquirer. 

In reality, payment systems slightly differ from this 
simplified description. For this paper’s purpose, one no- 
table difference is that the merchant may skip the step 
of contacting the acquirer to verify the transaction. For 
smaller retailers, this communication is ordinarily done 
via a telephone connection, so each authorization request 
incurs a cost. Thus, for low-risk transactions it may not 
be necessary to go online. Also, if the merchant’s ter- 
minal cannot make contact with the acquirer, due to the 
phone line being busy or other technical failure, the mer- 
chant may still decide to avoid losing the sale and never- 
theless accept the transaction. 


2.2 Smartcard applications 


State-of-the-art smartcards are capable of both symmet- 
ric and asymmetric cryptography, have several hundreds 
of KB of non-volatile tamper-resistant memory, and 
through secure operating systems may support multiple, 
mutually un-trusting, applications [3]. Although the po- 
tential applications are many, they are most commonly 
used for authentication of the holder, and more specifi- 
cally for debit and credit card payment systems, where 
less sophisticated smartcards are used. 

Smartcards have advantages in all three authorization 
processes discussed above, namely: 


Card authentication: the card was issued by an accept- 
able bank, is still valid and the account details have 
not been modified. 


Cardholder verification: the customer presenting the 
card is authorized to use it. 


Transaction authorization: the customer’s account has 
adequate funds for the transaction. 


EMV [15], named after its creators, Europay, Master- 
card and Visa, is the primary protocol for debit and credit 
card payments in Europe, and is known by a variety of 
different names in the countries where it is deployed (e.g. 
“Chip & PIN” in the UK). While the following section 
will introduce the EMV protocol, other payment systems 
are similar. 

In its non-volatile memory, the smartcard may hold ac- 
count details, cryptographic keys, a personal identifica- 
tion number (PIN) and a count of how many consecutive 
times the PIN has been incorrectly entered. 
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Cards capable of asymmetric cryptography can cryp- 
tographically sign account details under the card’s pri- 
vate key to perform card authentication. The merchant’s 
terminal can verify the signature with a public key which 
is stored on the card along with a certificate signed by 
the issuer whose key is, in turn, signed by the operator of 
the payment system network. This method is known as 
dynamic data authentication (DDA) or the variant, com- 
bined data authentication (CDA). 

As the merchants are not trusted with the symmetric 
keys held by the card, which would enable them to pro- 
duce forgeries, cards that are only capable of symmet- 
ric cryptography cannot be reliably authenticated offline. 
However, the card can still hold a static signature of ac- 
count details and corresponding certificate chain. The 
terminal can authenticate the card by checking this sig- 
nature, known as static data authentication (SDA), but 
the lack of freshness allows replay attacks to occur. 

Cardholder verification is commonly performed by re- 
quiring that the cardholder enter their PIN into the mer- 
chant’s terminal. The PIN is sent to the card which then 
checks if there have been too many consecutive incor- 
rect guessing attempts; if not, it checks if the PIN was 
entered correctly. If the terminal or card does not sup- 
port PIN verification, or the cardholder declines to enter 
it, the merchant may allow signature verification, or in 
unattended terminal scenarios, no authentication at all. 

The card may hold a history of transactions since it has 
last communicated with the issuer, and evaluate the risk 
of authorizing further transactions offline; otherwise, the 
card can request online authorization. In both cases, the 
card’s symmetric keys are used to produce a transaction 
certificate that is verified by the issuer. Merchants may 
also force a transaction to be online. 


2.3 Security goals and threat model 


The full threat model of EMV incorporates risk man- 
agement protocols where the card and terminal negotiate 
different methods of authenticating cardholders and the 
conditions for online or offline verification. This decision 
is reached by considering the transaction value and type 
(cash-back or goods), the card’s record of recent offline 
transactions and both the card issuer’s and merchant’s 
risk perception. This complexity and other features of 
EMV exist to manage the reality of all parties mistrust- 
ing all others (to varying extents). These details are out- 
side the scope of the paper and are further discussed in 
the EMV specification [15, book 2]. 

Instead, we assume that the merchant, the banks and 
customers are honest. We also exclude physical attacks, 
exploits of software vulnerabilities on both the smart- 
card and terminal, as well as attacks on the underlying 
cryptography. Other weaknesses of the EMV system are 


known, such as replay attacks on SDA cards as discussed 
above, and fallback attacks which force use of the mag- 
netic stripe, still present on smartcards for backwards 
compatibility. These weaknesses have been covered else- 
where [1, 4] and are anticipated to be resolved by even- 
tually disabling these legacy features. 

In our scenario, the goal of the attacker is to obtain 
goods or services by charging an unwitting victim who 
thinks she is paying for something different, at an at- 
tacker controlled terminal. 


3 Relay attacks 


Relay attacks were first described by Conway [13, p75], 
explaining how someone who does not know the rules of 
chess could beat a Grandmaster. This is possible by chal- 
lenging two Grandmasters at postal chess and relaying 
moves between them. While appearing to play a good 
game, the attacker will either win against one, or draw 
against both. Desmedt et al. [14] showed how such re- 
lay attacks could be applied against a challenge-response 
payment protocol, in the so called “mafia fraud”. 

We use the mafia-fraud scenario, illustrated in Fig- 
ure 1, where an unsuspecting restaurant patron, Alice, 
inserts her smartcard into a terminal in order to pay a 
$20 charge, which is presented to her on the display. 
The terminal looks just like any one of the numerous 
types of terminals she has used in the past. This par- 
ticular terminal, however, has had its original circuitry 
replaced by the waiter, Bob, and instead of being con- 
nected to the bank, it is connected to a laptop placed be- 
hind the counter. As Alice inserts her card into the coun- 
terfeit terminal, Bob sends a message to his accomplice, 
Carol, who is about to pay $2000 for a diamond ring 
at Dave’s jewellery shop. Carol inserts a counterfeit card 
into Dave’s terminal, which looks legitimate to Dave, but 
conceals a wire connected to a laptop in her backpack. 

Bob and Carol’s laptops are communicating wirelessly 
using mobile-phones or some other network. The data to 
and from Dave’s terminal is relayed to the restaurant’s 
counterfeit terminal such that the diamond purchasing 
transaction is placed on Alice’s card. The PIN entered 
by Alice is recorded by the counterfeit terminal and is 
sent, via a laptop and wireless headset, to Carol who en- 
ters it into the genuine terminal when asked. When the 
transaction is over, the crooks have paid for a diamond 
ring using Alice’s money, who got her meal for free, but 
will be surprised when her bank statement arrives. 

Despite the theoretical risk being documented, EMV 
is vulnerable to the relay attack, as suggested by Ander- 
son et al. [4]. Some believed that engineering difficulties 
in deployment would make the attack too expensive, or 
even impossible. The following section will show that 
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Figure 1: The EMV relay attack. Innocent customer, Alice, pays for lunch by entering her smartcard and PIN into a 
modified terminal operated by Bob. At approximately the same time, Carol enters her fake card into honest Dave’s 
terminal to purchase a diamond. The transaction from Dave’s terminal is relayed wirelessly to Alice’s card with the 


result of Alice unknowingly paying for Carol’s diamond. 


equipment to implement the attack is readily available, 
and costs are within the expected returns of fraud. 


3.1 Implementation 


This section describes the equipment we used for imple- 
menting the relay attack. We chose off-the-shelf com- 
ponents that allowed for fast development rather than 
miniaturisation or cost-effectiveness. The performance 
requirements were modest, with the only strict restriction 
being that our circuit hardware fit within the terminal. 


3.1.1 Counterfeit terminal 


Chip & PIN terminals are readily available for purchase 
online and their sale is not restricted. While some are 
as cheap as $10, our terminal was obtained for $50 from 
eBay and was ideal for our purposes due to its copious in- 
ternal space. Even if second hand terminals were not so 
readily available, a plausible counterfeit could be made 
from scratch as it is only necessary that it appears legiti- 
mate to untrained customers. 

Instead of reverse engineering the existing circuit, we 
stripped all internal hardware except for the keypad and 
LCD screen, and replaced it with a $200 Xilinx Spartan- 
3 small factor, USB-controlled, development board. We 
also kept the original smartcard reader slot, but wired its 
connections to a $40 USB GemPC Twin reader so we 
could connect it to the laptop. The result is a terminal 
with which we can record keypad strokes, display con- 
tent on the screen and interact with the inserted smart- 
card. The terminal appears and behaves just like a gen- 


uine one to the customer even though it lacks the ability 
to communicate with the bank. 


3.1.2 Counterfeit card 


At the jeweller’s, Carol needs to insert a counterfeit card 
connected to her laptop, into Dave’s terminal. We took 
a genuine Chip & PIN card and ground down the resin- 
covered wire bonds that connect the chip to the back of 
the card’s pads. With the reverse of the pads exposed, 
using a soldering iron, we pressed into the plastic thin, 
flat wires to the edge of the card. This resulted in a card 
that looked authentic from on the top side, but was actu- 
ally wired on the back side, as shown in Figure 2. The 
counterfeit card was then connected through a 1.5 m ca- 
ble to a $150 Xilinx Spartan-3E FPGA Starter Kit board 
to buffer the communications and translate them between 
the ISO 7816 and RS-232 protocols. Since the FPGA is 
not 5V tolerant, we use 390 O resistors on the channels 
that receive data from the card. For the bi-directional 
I/O channel, we use the Maxim 1740/1 level translator, 
which costs less than $2. 


3.1.3 Controlling software 


The counterfeit terminal and card are controlled by sepa- 
rate laptops via USB and RS-232 interfaces, respectively, 
using custom software written in Python. The laptops 
communicate via TCP over 802.11b wireless, although 
in principle this could be GSM or other wireless pro- 
tocol. This introduces significant latency, but far less 
than would be a problem as the timing critical operations 
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(a) With the exterior intact, the terminal’s original internal circuitry was replaced by 
a small factor FPGA board (left); FPGA based smartcard emulator (right) connected 


to counterfeit card (front). 


aie (3) 
(a) @)(a)'>) 
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(b) Customer’s view of terminal. Here, it is 
playing Tetris, to demonstrate that we have full 
control of the display and keypad. 


Figure 2: Photographs of tampered terminal and counterfeit card. 


on the counterfeit card are performed by the FPGA with 
real-time guarantees. 

One complication of selecting an off-the-shelf USB 
smartcard reader for the counterfeit terminal is that it op- 
erates at the application protocol data unit (APDU) level 
and buffers an entire command or response before send- 
ing it to the smartcard or the PC. This increases the time 
between when the genuine terminal sends a command 
and when the response can be sent; but, as previously 
mentioned, this is well within tolerances. 

This paper only deals with the “T=0” ISO 7816 sub- 
protocol, as used by all EMV smartcards we have exam- 
ined. Here, commands are uni-directional, i.e. either the 
command or response contains a payload but not both. 
Upon receiving a command code from the genuine ter- 
minal, any associated payload will not be sent by the ter- 
minal until the card acknowledges the command. The 
counterfeit card thus cannot tell whether to request a pay- 
load (for terminal — card commands) or send the com- 
mand code to the genuine card immediately (for card > 
terminal commands). 

Were the counterfeit terminal to incorporate a charac- 
ter level card reader, the partial command code could 
be sent to the genuine card and the result examined to 
determine the direction, but this is not permissible for 
APDU level transactions. Hence, the controlling soft- 
ware must be told the direction for each of the 14 com- 
mand codes. Other than this detail, the relay attack 
is protocol-agnostic and could be deployed against any 
ISO 7816 based system. 


3.2 Procedure and timing 


EMV offers a large variety of options, but the generality 
of the relay attack allows our implementation to account 
for them all; for simplicity, we will describe the proce- 
dure for the common case in the UK. That is, SDA card 
authentication (only the static cryptographic signature of 
the card details is checked), online transaction autho- 
rization (the merchant will connect to the issuer to ver- 
ify that adequate funds are available) and offline plaintext 
PIN cardholder verification (the PIN entered by the card- 
holder is sent to the card, unencrypted, and the card will 
check its correctness). 

Transaction authorization is accomplished by the card 
generating an application cryptogram (AC), which is au- 
thenticated by the card’s symmetric key and incorpo- 
rates transaction details from the terminal, a card transac- 
tion counter, and whether the PIN was entered correctly. 
Thus, the issuing bank can confirm that the genuine card 
was available and the correct PIN was used. Note that 
this only requires symmetric cryptography, and so will 
work even with SDA-only cards, as issued in the UK. 

The protocol can be described in six steps: 


Initialization: The card is powered up and returns the 
ATR. Then the terminal selects one of the possible 
payment applications offered by the card. 


Read application data The terminal requests card de- 
tails (account number, name, expiration date etc.) 
and verifies the static signature. 
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Cardholder verification: The cardholder enters their 
PIN into the merchant’s terminal and this is sent to 
the card for verification. If correct, the card returns 
a success code, otherwise the cardholder may try 
again until the maximum number of PIN attempts 
have been exceeded. 


Generate AC 1: The terminal requests an authorization 
request cryptogram (ARQC) from the card, which 
is sent to the issuing bank for verification, which 
then responds with the issuer authentication data. 


External authenticate: The terminal sends the issuer 
authentication data to the card. 


Generate AC 2: The terminal asks the card for a trans- 
action certificate (TC) which the card returns to the 
terminal if, based on the issuer authentication data 
and other internal state, the transaction is approved. 
Otherwise, it returns an application authentication 
cryptogram (AAC), signifying the transaction was 
denied. The TC is recorded by the merchant to 
demonstrate that it should receive the funds. 


This flow imposes some constraints on the relay at- 
tack. Firstly, Alice must insert her card before Carol in- 
serts her counterfeit card in order for initialization and 
read application data to be performed. Secondly, Al- 
ice must enter her PIN before Carol is required to en- 
ter it into the genuine terminal. Thirdly, Alice must not 
remove her card until the Generate AC 2 stage has oc- 
curred. Thus, the two sides of the radio link must be 
synchronised, but there is significant leeway as Carol can 
stall until she receives the signal to insert her card. 

After that point, the counterfeit card can request extra 
time from the terminal, before sending the first response, 
by sending a null procedure byte (0x60). The counter- 
feit terminal can also delay Alice by pretending to dial- 
up the bank and waiting for authorization until Carol’s 
transaction is complete. 

All timing critical sections, such as sending the ATR 
in response to de-assertion of reset and the encod- 
ing/decoding of bytes sent on the I/O, are implemented 
on the FPGA to ensure a fast enough response. There 
are wide timing margins between the command and re- 
sponse, so this is managed in software. 


3.3 Results 


We tested our relay setup with a number of different 
smartcard readers in order to test its robustness. Firstly, 
we used a VASCO Chip Authentication Program (CAP) 
reader (a similar device, but manufactured by Gemalto, is 
marketed by the UK bank Barclays as PINsentry). This 


is a handheld one-time-password generator for use in on- 
line banking, and implements a subset of the EMV pro- 
tocol. Specifically, it performs cardholder verification 
by checking the PIN and requests an application cryp- 
togram, which may be validated online. Our relay setup 
was able to reliably complete transactions, even when 
we introduced an extra three seconds of latency between 
command and response. While the attack we describe 
in most detail uses the counterfeit card in a retail out- 
let, a fraudster could equally use a CAP reader to access 
the victim’s online banking. This assumes that the PIN 
used for CAP is the same as for retail transactions and 
the criminal knows all other login credentials. 

The CAP reader uses a 1 MHz clock to decrease power 
consumption, but at the cost of slower transactions. We 
also tested our relay setup with a GemPC Twin reader, 
which operates at a 4 MHz frequency. The card reader 
was controlled by our own software, which simulates 
a Chip & PIN transaction. Here, the relay device also 
worked without any problems and results were identical 
to when the card was connected directly to the reader. 

Finally, we developed a portable version of the equip- 
ment, and took this to a merchant with a live Chip & PIN 
terminal. With the consent of the merchant and card- 
holder, we placed a transaction with our counterfeit card 
in the genuine terminal, and the cardholder’s card in the 
counterfeit terminal. In addition to the commands and re- 
sponses being relayed, the counterfeit terminal was con- 
nected to a laptop which, through voice-synthesis soft- 
ware, read out the PIN to our “Carol”. The transaction 
was completed successfully. One such demonstration of 
our equipment was shown on the UK consumer rights 
programme BBC Watchdog on 6th February 2007. 


3.4 Further applications and feasibility 


The relay attack is also applicable where “Alice” is not 
the legitimate card holder, but a thief who has stolen the 
card and observed the PIN. To frustrate legal investiga- 
tion and fraud detection measures, criminals commonly 
use cards in a different country from where they were 
stolen. Magnetic stripe cards are convenient to use in 
this way, as the data can be read and sent overseas, to 
be written on to counterfeit cards. However, chip cards 
cannot be fully duplicated, so the physical card would 
need to be mailed, introducing a time window where the 
cardholder may report the card stolen or lost. 

The relay attack can allow fraudsters to avoid this de- 
lay by making the card available online using a card 
reader and a computer connected to the Internet. The 
fraudster’s accomplice in another country could connect 
to the card remotely and place transactions with a coun- 
terfeit one locally. The timing constraints in this sce- 
nario are more relaxed as there is no customer expecting 
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to remove their genuine card. Finally, in certain types 
of transactions, primarily with unattended terminals, the 
PIN may not be required, making this attack easier still. 

APACS, the UK payment association, say they are un- 
aware of any cases of relay attacks being used against 
Chip & PIN in the UK [5]. The reason, we believe, is 
that even though the cost and the technical expertise that 
are required for implementing the attack are relatively 
low, there are easier ways to defeat the system. Methods 
such as card counterfeiting/theft, mail interception, and 
cardholder impersonation are routinely reported and are 
more flexible in deployment. 

These security holes are gradually being closed, but 
card fraud remains a lucrative industry — in 2006 £428m 
(= $850m) of fraud was suffered by UK banks [6]. Crim- 
inals will adapt to the new environment and, to maintain 
their income, will likely resort to more technically de- 
manding methods, so now is the time to consider how to 
prevent relay attacks for when that time arrives. 


4 Defences 


The previous section described how feasible it is to de- 
ploy relay attacks against Chip & PIN and other smart- 
card based authorization systems in practice. Thus, sys- 
tem designers must develop mitigation techniques while, 
for economic consideration, staying within the deployed 
EMV framework as much as possible. 


4.1 Non-solutions 


In this section we describe a number of solutions that are 
possible, or have been proposed, against our attack and 
assess their overall effectiveness. 


Tamper-resistant terminals A pre-requisite of our re- 
lay attack is that Alice will insert her card and enter her 
PIN into a terminal that relays these details to the re- 
mote attacker. The terminal, therefore, must either be 
tampered with or be completely counterfeit, but still ac- 
ceptable to cardholders. This implies a potential solution 
— allow the cardholder to detect malicious terminals so 
they will refuse to use them. Unfortunately, this cannot 
be reliably done in practice. 

Although terminals do implement internal tamper- 
responsive measures, when triggered, they only delete 
keys and other data without leaving visible evidence to 
the cardholder. Tamper-resistant seals could be inspected 
by customers, but Johnston et al. [21] have shown that 
many types of seals can be trivially bypassed. It would 
also be infeasible to give all customers adequate training 
to detect tampering or counterfeiting of seals. By induc- 
ing time-pressure and an awkward physical placement of 


the terminal, the attacker can make it extremely difficult 
for even a diligent customer to check for tampering. 

Even if it was possible to produce an effective seal, 
there are, as of May 2007, 304 VISA approved terminal 
designs from 88 vendors [24], so cardholders cannot be 
expected to identify them all. Were there only one termi- 
nal design, the use of counterfeit terminals would have to 
be prevented, which raises the same problems as tamper- 
resistant seals. Finally, with the large sums of money 
netted by criminals from card fraud, fabricating plastic 
parts is well within their budget. 


Imposing additional timing constraints While relay 
attacks will induce extra delays between commands be- 
ing sent by the terminal and responses being received, 
existing smartcard systems are tolerant to very high la- 
tencies. We have successfully tested our relay device 
after introducing a three second delay into transactions, 
in addition to the inherent delay of our design. This 
extra round-trip time could be exploited by an attacker 
450000 km away, assuming that signals propagate at the 
speed of light. Perhaps, then, attacks could be prevented 
by requiring that cards reply to commands precisely af- 
ter a fixed delay. Terminals could then confirm that a 
card responds to commands promptly and will otherwise 
reject a transaction. 

Other than the generate AC command, which includes 
a terminal nonce, the terminal’s behaviour is very pre- 
dictable. So an attacker could preemptively request these 
details from the genuine card then send them to the coun- 
terfeit card where they are buffered for quick response. 
Thus, the value of latency as a distance measure can only 
be exploited at the generate AC stages. Furthermore, 
Clulow et al. [12] show how wireless distance bounding 
protocols, based on channels which were not designed 
for the purpose, can be circumvented. Their comments 
apply equally well to wired protocols such as ISO 7816. 

To hide the latency introduced by mounting the re- 
lay attack, the attacker aims to sample signals early and 
send signals late, while still maintaining their accuracy. 
In ISO 7816, cards and terminals are required to sample 
the signal between the 20% and 80% portion of the bit- 
time and aim to sample at the 50% point. However, an 
attacker with sensitive equipment could sample near the 
beginning, and send their bit late. The attacker then gains 
50% of a bit-width in both directions, which at a 5 MHz 
clock is 37 us, or 11 km. 

The attacker could also over-clock the genuine card so 
the responses are returned more quickly. A DES calcu- 
lation could take around 100ms so only a 1% increase 
would give a 300km distance advantage. Even if the 
calculation time was fixed, and only receiving the re- 
sponse from the card could be accelerated, the counter- 
feit card could preemptively reply with the predictable 
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11 bytes (2 byte response code, 5 byte read more com- 
mand, 2 byte header and 2 byte counter) each taking 12 
bit-widths (start, 8 data bits, stop and 2 bits guard time). 
At 5 MHz + 1% this gives the attacker 98 us, ie. 29km. 

One EMV-specific problem is that the contents of the 
payload in the generate AC command are specified by 
the card in the card risk management data object list 
(CDOL). Although the terminal nonce should be at the 
end of the message in order to achieve maximum resis- 
tance to relay attacks, if the CDOL is not signed, the at- 
tacker could substitute the CDOL for one requesting the 
challenge near the beginning. Upon receiving the chal- 
lenge from the terminal, the attacker can then send this 
to the genuine card. Other than the nonce, the rest of the 
generate AC payload is predictable, so the counterfeit 
terminal can restore the challenge to the correct place, 
fill in the other fields and send it to the genuine card. 
Thus, the genuine card will send the correct response, 
even before the terminal thinks it has finished sending the 
command. A payload will be roughly 30 bytes, which at 
5 MHz gives 27 ms and a 8 035 km distance advantage. 

Nevertheless, eliminating needless tolerance to re- 
sponse latency would decrease the options available to 
the attacker. If it were possible to roll out this modifica- 
tion to terminals as a software upgrade, it might be ex- 
pedient to plan for this alteration to be quickly deployed 
in reaction to actual use of the relay attack. While we 
have described how this countermeasure could be cir- 
cumvented, attackers who build and test their system 
with high latency would be forced to re-architect it if the 
acceptable latency of deployed terminals were decreased 
without warning. 


4.2 Procedural improvements 


Today, merchants and till operators are accustomed to 
looking away while customers enter their PIN and sel- 
dom handle the card at all, while customers are often rec- 
ommended not to allow anyone but themselves to handle 
the card because of card skimming. In the case of relay 
attacks, this assists the criminal, not the honest customer 
or merchant. If the merchant examined the card, even 
superficially, he would detect the relay attack, as we im- 
plemented it, by spotting the wires. That said, it is not in- 
feasible that an RFID proximity card could be modified 
to relay data wirelessly to a local receiver and therefore 
appear to be a genuine one. 

A stronger level of protection can be achieved if, af- 
ter the transaction is complete, the merchant checks not 
only that the card presented is legitimate, but also that the 
embossed card number matches the one on the receipt. 
In the case of the relay attack, the receipt will show the 
victim’s card number, whereas the counterfeit card will 
show the original number of the card from before it was 


tampered. For these to match, the fraudster must have 
appropriate blank cards and an embossing machine, in 
addition to knowing the victim’s card number in advance. 

Alternatively, a close to real-time attack could still be 
executed with a portable embossing machine. Existing 
devices take only a few seconds to print a card and it 
is feasible that fraudsters can make them portable. The 
quality of counterfeit cards and embossing need not be 
high, just sufficient to pass a cursory inspection. More 
recent smartcards are being issued without embossing, 
as the carbon-paper payment method is no longer used, 
making counterfeits even easier to produce. If none of 
these possibilities are open to the fraudster, repeat cus- 
tomers could be targeted and so creating a wide window 
of opportunity. In some scenarios, such as unattended 
Chip & PIN terminals, ATMs, or where the terminal is 
on the opposite side of a glass barrier, physical card in- 
spections would not be possible; but even where it is, the 
merchant must be diligent. 

Varian [23] argues that if the party who is in the best 
position to prevent fraud does not have adequate incen- 
tives to do so, security suffers. If customers must de- 
pend on merchants, who they have no relationship with, 
for their protection, then there are mismatched incen- 
tives. Merchants selling low-marginal-cost products or 
services (e.g. software or multimedia content), have little 
desire to carefully check for relay attacks. This is be- 
cause, in the case of fraud, costs will likely be borne by 
the customer. Even if the transaction is subsequently re- 
versed when fraud is detected, the merchant has lost only 
the low marginal cost and the chargeback overhead, but 
has saved the effort of checking cards. 


4.3. Hardware alterations 


The electronic attorney is a trusted device that is brought 
into the transaction by the customer so that the mer- 
chant’s terminal does not need to be trusted; this is called 
the “man-in-the-middle defence”, as suggested by An- 
derson and Bond [2]; trusted devices to protect customers 
are also discussed by Asokan et al. [7]. The device is in- 
serted into the terminal’s card slot while the customer 
inserts their card into the device. The device can display 
the transaction value as it is parsed from the data sent 
from the terminal, allowing the customer to verify that 
she is charged the expected amount. If the customer ap- 
proves the transaction, she presses a button on the elec- 
tronic attorney itself, which allows the protocol to pro- 
ceed. This trusted user interface is necessary, since if a 
PIN was used as normal, a fraudster could place a legiti- 
mate transaction first, which is accepted by the customer, 
but with knowledge of the PIN a subsequent fraudulent 
one can be placed. Alternatively, one-time-PINs could 
be used, but at a cost in usability. 
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Because the cardholder controls the electronic attor- 
ney, and it protects the cardholder’s interests, the incen- 
tives are properly aligned. Market forces in the business 
of producing and selling these devices should encourage 
security improvements. However, this extra device will 
increase costs, increase complexity and may not be ap- 
proved of by banking organizations. Additionally, fraud- 
sters may attempt to discourage their use, either explic- 
itly or by arranging the card slot so the use of a electronic 
attorney is difficult. A variant of the trusted user interface 
is to integrate a display into the card itself [8]. 

Another realization of the trusted user interface for 
payment applications is to integrate the functionality of 
a smartcard into the customer’s mobile phone. This can 
allow communication with the merchant’s terminal using 
near field communications (NFC) [20]. This approach 
is already under development and has the advantage of 
being a customer-controlled device with a large screen 
and convenient keypad, allowing the merchant’s name 
and transaction value to be shown and once authorized 
by the user, entry of the PIN. Wireless communications 
also ease the risk of a malicious merchant arranging the 
terminal so that the trusted display device is not visible. 
Although mobile phones are affordable and ubiquitous, 
they may still not be secure enough for payment applica- 
tions as they can be, for example, targeted by malware. 


5 Distance bounding 


None of the techniques detailed in Section 4.1 are ad- 
equate to completely defeat relay attacks. They are ei- 
ther impractical (tamper-resistant terminals), expensive 
(adding extra hardware) or circumventable (introduc- 
ing tighter timing constraints and requiring merchants to 
check card numbers). Due to the lack of a customer- 
trusted user interface on the card, there is no way to de- 
tect a mismatch between the data displayed on the termi- 
nal and the data authorized by the card. However, relay 
attacks can be foiled if either party can securely establish 
the position of the card which is authorizing the transac- 
tion, relative to the terminal processing it. 

Absolute positioning is infeasible due to the cost and 
form factor requirements of smartcards being incompat- 
ible with GPS, and also because the civilian version is 
not resistant to spoofing [22]. However, it is possible 
for the terminal to securely establish a maximum dis- 
tance bound, by measuring the round-trip-time between 
it and the smartcard; if this time is too long, an alarm 
would be triggered and the transaction refused. De- 
spite the check being performed at the merchant end, 
the incentive-compatibility problem is lessened because 
the distance verification is performed by the terminal and 
does not depend on the sales assistant being diligent. 


The approach of preventing relay attacks by mea- 
suring round-trip-time was first proposed by Beth and 
Desmedt [9] but Brands and Chaum [11] described the 
first concrete protocol. The cryptographic exchange in 
our proposal is based on the Hancke-Kuhn protocol [17], 
because it requires fewer steps, and it is more efficient 
if there are transmission bit errors compared to Brands- 
Chaum. However, the Hancke-Kuhn protocol is pro- 
posed for ultra-wideband radio (UWB), whereas we re- 
quire synchronous half-duplex wired transmission. 

One characteristic of distance-bounding protocols, un- 
like most others, is that the physical transmission layer 
is security-critical and tightly bound to the other layers, 
so care must be taken when changing the transmission 
medium. Wired transmission introduces some differ- 
ences, which must be taken into consideration. Firstly, 
to avoid circuitry damage or signal corruption, in a wired 
half duplex transmission, contention (both sides driving 
the I/O at the same time) must be avoided. Secondly, 
whereas UWB only permits the transmission of a pulse, 
wired allows a signal level to be maintained for an ex- 
tended period of time. Hence, we may skip the initial 
distance-estimation stage of the Hancke-Kuhn setup and 
simplify our implementation. 

While in this section we will describe our implemen- 
tation in terms of EMV, implemented to be compatible 
with ISO 7816, it should be applicable to any wired, half- 
duplex synchronous serial communication line. 


5.1 Protocol 


In EMV, authentication is only card to terminal so we 
follow this practise. Following the Hancke-Kuhn termi- 
nology, the smartcard is the prover, P, and terminal is the 
verifier, V. This is also appropriate because the Hancke- 
Kuhn protocol puts more complexity in the verifier than 
the prover, and terminals are several orders of magnitude 
more expensive and capable than the cards. The protocol 
is described as follows: 


Initialization : 
V—P: Nv € {0,1} 
P—V: Npe {0,1} 
P: (Rf\|R}) =Hx(Nv, Np) € {0,1}? 
Rapid bit-exchange : 
Vo-P: GE {0, 1} 
P3Vy Rove{0,1} 


At the start of the initialization phase, nonces and pa- 
rameters are exchanged over a reliable data channel, with 
timing not being critical. Ny and Np provide fresh- 
ness to the transaction in order to prevent replay attacks, 
with the latter preventing a middle-man from running the 
complete protocol twice between the two phases using 
the same Ny and complementary C; and thus, obtain 
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A 8 8 F 6 D 7 5 
C;: 1010 0011 1000 1111 0110 1101 0111 0101 


Re : 2020 1llax 2011 errex Oral xrl1e laoerex 1x0x 


Rt : lxOx 2x10 lxaxx 0001 x10z 0120 x111 x120 


Ro : 1000 1110 1011 0001 0101 0110 1111 1100 
8 E B 1 5 6 F C 


Table 1: Example of the rapid bit-exchange phase of the 
distance bounding protocol. For clarity, x is shown in- 
stead of the response bits not sent by the prover. The left 
most bit is sent first. 


both R° and R}. The prover produces a MAC under its 
key, K’, using a keyed pseudo-random function, the result 
of which is split into two shift registers, R? and R}. 

In the timing-critical rapid bit-exchange phase, the 
maximum distance between the two participants is deter- 
mined. V sends a cryptographically secure pseudoran- 
dom single-bit challenge C; to P, which in turn imme- 
diately responds with Re the next single-bit response, 
from the corresponding shift register. A transaction of a 
32 bit exchange is shown in Table 1. 

If a symmetric key is used, this will require an on-line 
transaction to verify the result because the terminal does 
not store kK’. If the card has a private/public key pair, a 
session key can be established and the final challenge- 
response can also be verified offline. The values a and b, 
the nonce and shift register bit lengths, respectively, are 
security parameters that are set according to the applica- 
tion and are further discussed in Section 5.5. 

This exchange succeeds in measuring distance be- 
cause it necessitates that a response bit arrive at a certain 
time after the challenge has been sent. When the proto- 
col execution is complete, V’s response register, Fe is 
verified by the terminal or bank to determine if the prover 
is within the allowed distance for the transaction. 


5.2 Implementation 


ISO 7816, our target application, dictates that the smart- 
card (prover) is a low resource device, and therefore, 
should have minimal additions in order to keep costs 
down; this was our prime constraint. The terminal (ver- 
ifier), on the other hand, is a capable, expensive de- 
vice that can accommodate moderate changes and addi- 
tions without adversely affecting its cost. Of course, the 
scheme must be secure to all attacks devised by a highly 
capable adversary that can relay signals at the speed of 
light, is able to ensure perfect signal integrity, and can 
clock the smartcard at higher frequencies than it was de- 
signed for. We assume, however, that this attacker does 
not have access to the internal operation of the terminal 
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Figure 3: Waveforms of a single bit-exchange of the 
distance bounding protocol. fy is the verifier’s clock; 
DRV c drives the challenge on to I/O; SMPLR samples 
the response; CLKy_,p is the prover’s clock; I/Oy and 
I/O p are versions of the I/O on each side accounting for 
the propagation delay tg; SMPL¢ is the received clock 
that is used to sample the challenge; and DRV p drives 
the response on to the I/O. 
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and that extracting secret material out of the smartcard, 
or interfering with its security critical functionality, is not 
economical considering the returns from the fraud. 


5.3 Circuit elements and signals 


For this section refer to Table 2 for signal names and their 
function, Figure 4 for the circuit diagram and Figure 3 for 
the signal waveforms. 


Clocks and frequencies As opposed to the prover, the 
verifier may operate at high frequencies. We have im- 
plemented the protocol such that one clock cycle of the 
verifier’s operating frequency, fy, determines the dis- 
tance resolution. Since signals cannot travel faster than 
the speed of light, c, the upper-bound distance resolution 
is therefore, c/ fy. Thus, fv, should be chosen to be as 
high as possible. We selected 200 MHz which allows 
us a 1.5m resolution under ideal conditions for the at- 
tacker. We have made the prover’s operating frequency, 
fr, compatible with any frequencies having a high-time 
greater than f, + fv’ + ta, where tq defines the time 
between when the challenge is being driven onto the I/O 
and when the response is sampled by the verifier; tq is 
the delay between V and P. ISO 7816 specifies that 
the smartcard/prover needs to operate at 1-5 MHz and 
in order to be compatible, we chose fp = fy /128 = 
1.56 MHz for our implementation. 


Shift registers The design has four 64 bit shift regis- 
ters (SR): the verifier’s challenge and received response 
SR’s and the prover’s two response SR’s. The challenge 
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Signals & timing Description 
parameters 
CLKy, fv Verifier’s clock and frequency; determines the distance resolution 
CLKy—p, fe Prover’s clock and frequency; received from verifier 
DRVc While asserted the challenge is transmitted 
tn Length of time verifier drives the challenge on to the I/O 
SMPLo Prover samples challenge on rising edge 
bri Length of time between assertion of DRV¢ to assertion of CLKy_.p 
DRVR Prover transmits response 
tp Amount of delay applied to SMPLco 
SMPLR Verifier samples response on rising edge 
tq Time from assertion of CLKy_, p to rising edge of SMPL pr; determines 
upper bound of prover’s distance 
ta Propagation delay through distance d 
Table 2: Signals and their associated timing parameters. 
CLKy SMPLc 


verifier | challenge SR 


response SR 
SMPLrR 



















response SR’s 


eauel 


challenge 








Figure 4: Simplified diagram of the distance bounding circuit. DRVc controls when the challenge is put on the I/O 
line. CLKy controls the verifier’s circuit; it is divided and is received as SMPLc at the prover where it is used to 
sample the challenge. A delay element produces DRV pr, which controls when the response is put the I/O, while at the 
verifier SMPLp samples it. The pull-up resistor R is present to pull the I/O line to a stable state when it is not actively 


driven by either side. 


SR is clocked by CLKy and is shifted one clock cycle 
before it is driven on to the I/O line by DRVc. The veri- 
fier’s response SR is also clocked by CLK and is shifted 
on the rising edge of SMPLpr. On the prover side, the 
SR’s are clocked and shifted by SMPLco. 


Bi-directional I/O The verifier and prover communi- 
cate using a bi-directional I/O with tri-state buffers at 
each end. These buffers are controlled by the signals 
DRVc and DRV, and are implemented such that only 
one side drives the I/O line at any given time in order to 
prevent contention. This is a consequence of adapting 
the Hancke-Kuhn protocol to a wired medium, and im- 
plies that the duration of the challenge must be no longer 
than necessary, so as to obtain the most accurate distance 
bound. A pull-up is also present, as with the ISO 7816 
specification, to maintain a high state when the line is 


not driven by either side. As a side note, if the con- 
straints imposed by ISO 7816 are not to be adhered to, 
two uni-directional wires for the challenge and response 
could have been used for easier implementation. 


5.4 Timing 


A timing diagram of a single challenge-response ex- 
change is shown in Figure 3. The circuit shown in Fig- 
ure 4 was implemented on an FPGA using Verilog (not 
all peripheral control signals are shown for the sake of 
clarity). Since we used a single chip, the I/O and clock 
lines were “looped-back” using various length transmis- 
sion wires to simulate the distance between the verifier 
and prover as shown in Figure 5. 

The first operation is clocking the challenge shift reg- 
ister (not shown), which is driven on to the I/O line by 
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Figure 5: The Xilinx XUP board with a VirtexII-PRO 30 
FPGA on which the distance bounding design was imple- 
mented. Both verifier and prover reside on the same chip 
connected only by two same-length tranmission lines for 
I/O and clock (1 m shielded cables are shown). 


DRV¢ on the following fy clock cycle for a t, period. 
tn Should be made long enough to ensure that the prover 
can adequately and reliably sample the challenge, and 
as short as possible to allow the response to be rapidly 
sent while not causing contention. The clock sent to P, 
CLKy_.p, is asserted t,,, after the rising edge of DRVc. 
Both CLKy_.p and the I/O line have the same propa- 
gation delay, tg, and when the clock edge arrives (now 
called SMPLc), it samples the challenge. The same 
clock edge also shifts the two response registers, one of 
which is chosen by a 2:1 multiplexer that is controlled 
by the sampled challenge. DRV z is a delayed replica of 
SMPLc¢, which is created using a delay element. 

The delay, ¢,, allows the response SR signals to shift 
and propagate through the multiplexer, preventing the in- 
termediate state of the multiplexer from being leaked. 
Otherwise, the attacker could discover both responses to 
the previous challenge in the case where C; 4 Cj_1. 
t» may be very short but should be at least as long as 
the period from the rising edge of SMPL¢ to when the 
response emerges from the multiplexer’s output; in our 
implementation, we used deliberately placed routing de- 
lays to adjust ¢,, which can be as short as 500 ps. When 
DRV z is asserted, the response is being driven on to the 
I/O line until the falling edge. 

At the verifier, the response is sampled by SMPLr 
after t, from the assertion of CLKy_.p. The value of 
t, determines the distance measured and should be long 
enough to account for the propagation delay that the sys- 
tem was designed for (including on-chip and package de- 
lays), and short enough to not allow an attacker to be fur- 
ther away than desired, with the minimum value being 


tp + 2tg. As an improvement, t, can be dynamically ad- 
justed between invocations of the protocol allowing the 
verifier to make decisions based on the measured dis- 
tance, for example, determine the maximum transaction 
amount allowed. With a single iteration, the verifier can 
discover the prover’s maximum distance away, but with 
multiple iterations, the exact distance can be found with 
a margin of error equal to the signal propagation time 
during a single clock cycle of the verifier. SMPLp may 
be made to sample on both rising and falling edges of 
fv. effectively doubling the distance resolution without 
increasing the frequency of operation (other signals may 
operate this way for tighter timing margins). 

If we assume that an attacker can transmit signals at 
the speed of light and ignore the real-life implications of 
sending them over long distances, we can determine the 
theoretical maximum distance between the verifier and 
prover. A more realistic attacker will need to overcome 
signal integrity issues that are inherent to any system. 
We should not, therefore, make it easy for the attacker 
by designing with liberal timing constraints, and choose 
the distance between the verifier and prover, d, to be as 
short as possible. More importantly, we should carefully 
design the system to work for that particular distance 
with very tight margins. For example, the various termi- 
nals we have tested were able to transmit/drive a signal 
through a two meter cable, although the card should at 
most be a few centimeters away. Weak I/O drivers could 
be used to degrade the signal when an extention is ap- 
plied. The value of d also determines most of the timing 
parameters of the design, and as we shall see next, the 
smaller these are, the harder it will be for the attacker to 
gain an advantage. 


5.5 Possible attacks on distance bounding 


Although, following from our previous assumptions, the 
attacker cannot get access to any more than half the re- 
sponse bits, there are ways he may extend the distance 
limit before a terminal will detect the relay attack. This 
section discusses which options are available, and their 
effectiveness in evading defences. 


Guessing attack Following the initialization phase, 
the attacker can initiate the bit-exchange phase before 
the genuine terminal has done so. As the attacker does 
not know the challenge at this stage, he will, on average, 
guess 50% of the challenge bits correctly and so receive 
the correct response for those. For the ones where the 
challenge was guessed incorrectly, the response is effec- 
tively random, so there is still a 50% chance that the re- 
sponse will be correct. Therefore the expected success 
rate of this technique is 75%. 
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Since our tests show a negligible error rate, the termi- 
nal may reject any response with a single bit that is incor- 
rect. In our prototype, where the response registers are 
64 bits each, the attacker will succeed with probability 
Ce ~ 1in2?°. The size of the registers is a security 
parameter that can be increased according to the applica- 
tion, while the nonces assure that the attacker can only 


guess once. 


Replay If the attacker can force the card to perform 
two protocol runs, with the same nonces used for both, 
then all bits of the response can be extracted by sending 
all 1’s on the first iteration and all 0’s on the second. We 
resist this attack by selecting the protocol variant men- 
tioned by Hancke and Kuhn [17], where the card adds 
its own nonce. This is cheap to do within EMV since 
a transaction counter is already required by the rest of 
the protocol. If this is not desired then provided the card 
cannot be clocked at twice its intended frequency, the at- 
tacker will not be able to extract all bits in time. This as- 
sumes that the time between starting the distance bound- 
ing protocol, and the earliest time the high-speed stage 
can start, is greater than the latter’s duration. 


Early bit detection and deferred bit signalling The 
card will not sample the terminal’s challenge until t,,44 
after the challenge is placed on the I/O line. This is to al- 
low an inexpensive card to reliably detect the signal but, 
as Clulow et al. [12] suggest, an attacker who is willing 
to invest in expensive equipment could, in theory, detect 
the signal immediately. By manipulating the clock pro- 
vided to the genuine card, and using high-quality signal 
drivers, the challenge could be sent to the card with less 
of a delay. 

Similarly, the terminal will wait t, between sending 
the challenge and sampling the response, to allow for the 
round trip signal propagation time, and wait until the re- 
sponse signal has stabilized. Again, with superior equip- 
ment the response could be sent from the card just before 
the terminal samples. The attacker, however, cannot do 
so any earlier than ¢,, after the card has sampled the chal- 
lenge, and the response appears on the I/O. 


Delay-line manipulation The card may include the 
value of t, in its signed data, so the attacker cannot make 
the terminal believe that the value is larger than the card’s 
specification. However, the attacker might be able to re- 
duce the delay, for example by cooling the card. If it can 
be reduced to the point that the multiplexer or latch has 
not settled, then both potential responses may be placed 
on to the I/O line, violating our assumptions. 

However, if the circuit is arranged so that the delay 
will be reduced only if the reaction of the challenge latch 


and multiplexer is improved accordingly, the response 
will still be sent out prematurely. This gives the attacker 
extra time, so should be prevented. If temperature com- 
pensated delay lines are not economic, then they should 
be as short as possible to reduce this effect. 

In fact, ¢, may be so small, even less than | ns, that the 
terminal could just assume it would be zero. This will 
mean that the terminal will believe all cards are slightly 
further away than they really are, but will avoid the value 
of t,, having to be included in the signed data. 


Combined attacks For an attacker to gain a better than 
1 in 27° probability of succeeding in the challenge re- 
sponse protocol, the relay attack must take less than 
tm+q time. In practice, an attacker will not be able 
to sample or drive the I/O line instantaneously and the 
radio-link transceiver or long wires will introduce la- 
tency, so the attacker would need to be much closer than 
this limit. A production implementation on an ASIC 
would be able to give better security guarantees and be 
designed to tighter specifications than were available on 
the FPGA for our prototype. 


5.6 Results 


We have developed a versatile implementation that re- 
quires only modest modification to currently deployed 
designs. Our distance bounding scheme was success- 
fully implemented and tested on an FPGA for 2.0, 1.0, 
and 0.3 meter transmission lengths, although it can be 
modified to work for any distance and tailored to any end 
application. Oscilloscope traces of a single bit challenge- 
response exchange over a 50Q, 30cm printed circuit 
board transmission line are shown in Figure 6. In this 
case, the challenge is 1 and the response is 0 with indica- 
tors where SMPL» has sampled the response. The first, 
after ¢ = 15ns has sampled too early while the sec- 
ond, tgpass = 20 ns, which is a single period of fy later, 
has correctly sampled the 0 as the response. The delay 
tq = 2.16ns, can also be seen and is, of course, due to 
the length of the transmission line. If the attacker ex- 
ploited all possible attacks previously discussed and was 
able to transmit signals at c, he would need to be within 
approximately 6m, although the actual distance would 
be shorter for a realistic attacker. 


fail 


5.7 Costs 


The FPGA design of both the verifier and prover as 
shown in Figure 5 consumes 37 flip-flops and 93 look- 
up tables: 64 for logic, 13 route-throughs, and 16 as 
shift registers (4 cascaded 16-bit LUTs for each), which 
is extremely compact, and consumes well under 0.5% 
of the resources available on our FPGA. However, it is 
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(b) Single bit exchange, challenge is 1 and response is 0. 


Figure 6: Oscilloscope trace from the bit-exchange phase of the distance bounding protocol. Delay is introduced by a 
30 cm transmission line between the verifier and prover. Timing parameters are t,, = 10 ns, ¢,, = 5ns, t, = 8ns. Two 
values of tg are shown, one where the bit was correctly received ¢,,,,., = 20 ns and one where it was not, t,,,,, = 15 ns. 
tq was measured to be 2.16ns which over a 30cm wire corresponds to propagation velocity of 1.4 x 10° m/s. Note 
that before the challenge is sent, the trace is slowly rising above ground level; this is the effect of the pull-up resistor 
as also seen in (a) after the protocol completes. The shown signals were probed at the FPGA I/Os and do not precisely 
represent when they actually appear inside of it. For example, the FPGA I/O introduces 3-5 ns delay to the signal so 
in actuality the FPGA will “see” the falling edge shown in (b) slightly after what is represented in the figure. On-chip 
delay also affects the design and is not shown, but must be accounted for. 
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difficult to estimate the cost of an ASIC implementation 
with these figures as there is no reliable conversion tech- 
nique between FPGA resource utilization and ASIC tran- 
sistor count, especially since the above numbers are for 
the core functions, without the supporting circuitry. It is 
also hard to estimate the cost in currency because that 
changes rapidly with time, production volume, fabrica- 
tion process, and many other factors, so we will describe 
it relative to the resources currently used. 

As mentioned, we have made every effort to minimize 
the circuitry that needs to be added to the smartcard while 
being more liberal with the terminal, although for both 
the additions can be considered minor. For the smartcard, 
new commands for initiating the initialization phase need 
to be added as well as two shift registers and a state ma- 
chine for operating the rapid bit-exchange. Consider- 
ing that smartcards already have a few thousand memory 
cells, this can be considered a minor addition, especially 
given that they need to operate at the existing low fre- 
quencies of 1-5 MHz. For the initialization phase, exist- 
ing circuits can be used such as the DES engine for pro- 
ducing the content of the response registers. The card’s 
transaction counter may be used for the nonce, N,. 

As for the terminals, their internal operating frequency 
is unknown to us, but it is unlikely that it is high enough 
to achieve good distance resolution. Therefore, a capable 
processor and some additional components are required, 
such as a high quality oscillator. As an alternative to high 
frequencies, or when designing for very short distances, 
delay lines could be used instead of operating on clock 
edges. The distance bounding circuitry would need to be 
added to the terminal’s main processor, which consists 
of two shift registers and slightly more involved control 
code than the smartcard’s. 

We have described the added cost in terms of hard- 
ware but the added time per transaction and the need to 
communicate with the bank, refused transactions due to 
failure, re-issuing cards, and so on, may amount to sub- 
stantial costs. Only the banks involved have access to all 
the necessary information needed to make a reasonable 
estimate of these overheads. 


6 Discussion 


The distance bounding protocol we have proposed will 
detect attempted relay attacks but requires the banks to 
produce cards and terminals that support the protocol. 
However, the person being defrauded is the cardholder, 
Alice, who must use the cards and terminals she is given, 
so has no trusted user interface and no way to protect 
herself. As mentioned in Section 4.2, this incentive 
mismatch may be detrimental to the cardholder’s secu- 
rity. For instance, until all terminals support the distance 
bounding protocol, the issuer can select whether to fall- 


back to the current protocol that is vulnerable to relay- 
attacks. Under existing UK practice, the customer is li- 
able for PIN-verified fraudulent transactions [10], so the 
issuer may elect to accept fallback transactions knowing 
that the cardholder is carrying the risk. 

A further problem of the distance bounding protocol 
is the lack of non-repudiation: for a third party to ver- 
ify that a relay attack was not in progress, the merchant’s 
terminal must be trusted to correctly report the round-trip 
latency. Thus, if a customer claims that a transaction is 
fraudulent, then even if the distance bounding protocol is 
recorded to have succeeded, there remains the possibil- 
ity that the terminal has been tampered with. It falls on 
the acquirer to mandate tamper-resistant terminals, but 
although the payment network may require that all mem- 
bers implement appropriate protections, the customer is 
only represented indirectly by the issuer. 

So while inexpensive yet strong technical solutions, 
such as distance bounding, do exist they must be de- 
ployed as part of an appropriate liability framework to 
fully realize their benefits. The current situation, where 
customers are liable for fraud, yet powerless to verify 
whether a terminal is genuine, is clearly unfair. If the 
power of banking institutions is too great to alter the 
entrenched notion of customer liability, then measures 
that put the cardholder in a position of control, such as 
the electronic attorney [2], despite being more expen- 
sive, may be the most appropriate solution. However, 
customers should exercise caution before accepting these 
options as the second-order effect of the customer being 
able to detect attacks could be to make them reasonably 
liable for any fraud which is nevertheless perpetrated. 


7 Conclusion 


This paper described relay attacks and how they can be 
applied to exploit smartcard-based payment systems. A 
prototype was built and shown to be successful against 
the Chip & PIN payment system deployed in the UK. 
This consisted of creating a fake terminal and custom 
hardware to allow the relaying of information between 
the participating parties. We suggested procedural im- 
provements to the acceptance of Chip & PIN transac- 
tions, which would provide a short-term defence, but 
these could be circumvented by a plausible attacker. We 
then developed the first implementation of a distance 
bounding defence against these relay attacks and showed 
it to be the most robust solution. Our implementation 
was designed to be appealing for adoption in the next 
generation of smartcards by tailoring the design to the 
EMV framework. 

Future work may include implementing a wireless 
variant of the protocol, mutual distance bound establish- 
ment and customizing the system to other applications. 
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Abstract 


Although motivated by both usability and security con- 
cerns, the existing literature on click-based graphical 
password schemes using a single background image 
(e.g., PassPoints) has focused largely on usability. We 
examine the security of such schemes, including the im- 
pact of different background images, and strategies for 
guessing user passwords. We report on both short- and 
long-term user studies: one lab-controlled, involving 43 
users and 17 diverse images, and the other a field test of 
223 user accounts. We provide empirical evidence that 
popular points (hot-spots) do exist for many images, and 
explore two different types of attack to exploit this hot- 
spotting: (1) a “human-seeded” attack based on harvest- 
ing click-points from a small set of users, and (2) an en- 
tirely automated attack based on image processing tech- 
niques. Our most effective attacks are generated by har- 
vesting password data from a small set of users to attack 
other targets. These attacks can guess 36% of user pass- 
words within 23! guesses (or 12% within 21° guesses) 
in one instance, and 20% within 23° guesses (or 10% 
within 2!8 guesses) in a second instance. We perform 
an image-processing attack by implementing and adapt- 
ing a bottom-up model of visual attention, resulting in a 
purely automated tool that can guess up to 30% of user 
passwords in 2®° guesses for some instances, but under 
3% on others. Our results suggest that these graphical 
password schemes appear to be at least as susceptible to 
offline attack as the traditional text passwords they were 
proposed to replace. 


1 Introduction 


The bane of password authentication using text-based 
passwords is that users choose passwords which are easy 
to remember, which generally translates into passwords 
that are easily guessed. Thus even when the size of 
a password space may be theoretically “large enough” 


(in terms of number of possible passwords), the effec- 
tive password space from which many users actually 
choose passwords is far smaller. Predictable patterns, 
largely due to usability and memory issues, thus allow 
successful search by variations of exhaustive guessing 
attacks. Forcing users to use “random” or other non- 
meaningful passwords results in usability problems. As 
an alternative, graphical password schemes require that 
a user remembers an image (or parts thereof) in place of 
a word. They have been largely motivated by the well- 
documented human ability to remember pictures better 
than words [25], and implied promises that the password 
spaces of various image-based schemes are not only suf- 
ficiently large to resist guessing attacks, but that the ef- 
fective password spaces are also sufficiently large. The 
latter, however, is not well established. 


Among the graphical password schemes proposed to 
date, one that has received considerable attention in the 
research literature is PassPoints [45, 46, 47]. It and other 
click-based graphical password schemes [18, 4, 31, 37] 
require a user to log in by clicking a sequence of points 
on a single background image. Usability studies have 
been performed to determine the optimal amount of er- 
ror tolerance [46], login and creation times, error rates, 
and general perception [45, 47]. An important remain- 
ing question for such schemes is: how secure are they? 
This issue remains largely unaddressed, despite specu- 
lation that the security of these schemes likely suffers 
from hot-spots — areas of an image that are more prob- 
able than others for users to click. Indeed, the impact 
of hot-spots has been downplayed (e.g., see [45, Section 
7]). In this paper, we focus on a security analysis of an 
implementation with the same parameters as used in a re- 
cent PassPoints publication [47]. A usability analysis of 
this implementation is presented in a separate paper [6]. 

We confirm the existence of hot-spots through empiri- 
cal studies, and show that some images are more suscep- 
tible to hot-spotting than others. We also explore the se- 
curity impact of hot-spots, including a number of strate- 
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gies for exploiting them under an offline model similar 
to that used by Ballard et al. [1]. Our work involves two 
user studies. The first (lab) study used 17 diverse im- 
ages (four used in previous studies [46], and 13 of our 
own chosen to represent a range of detail). We collected 
graphical passwords for 32-40 users per image in a lab 
setting, and found hot-spots on all images even from this 
relatively small sample size; some images had signifi- 
cantly more hot-spots than others. In the second (field) 
study involving 223 user accounts over a minimum of 
seven weeks, we explore two of these images in greater 
depth. We analyzed our lab study data using formal mea- 
sures of security to make an informed decision of which 
two images to use in the field study. Our goal was to give 
PassPoints the best chance we could (in terms of antici- 
pated security), by using one highly ranked image, and 
another mid-ranked image also used in previous Pass- 
Points studies. 


We implement and evaluate two types of attack: 
human-seeded and purely automated. Our human-seeded 
attack is based on harvesting password data from a small 
number of users to attack passwords from a larger set 
of users. We seed various dictionaries with the pass- 
words collected in our lab study, and apply them to guess 
the passwords from our long-term field study. Our re- 
sults demonstrate that this style of attack is quite effec- 
tive against this type of graphical password: it correctly 
guessed 36% of user passwords within 2°! guesses (or 
12% within 2° guesses) on one image, and 20% within 
235 suesses (or 10% within 2'§ guesses) on a second im- 
age. We implement and adapt a combination of image 
processing methods in an attempt to predict user choice, 
and employ them as tools to expedite guessing attacks on 
the user study passwords. The attack works quite well on 
some images, cracking up to 30% of passwords, but less 
than 3% on others within 2°° guesses. These results give 
an early signal that image processing can be a relevant 
threat, particularly as better methods emerge. 


Our contributions include the first in-depth study of 
hot-spots in click-based (or cued-recall) graphical pass- 
words schemes and their impact on security through two 
separate user studies: one lab-controlled and the other 
a field test. We propose the modification and use of 
image processing methods to expedite guessing attacks, 
and evaluate our implementation against the images used 
in our studies. Our implementation is based on Itti et 
al.’s [17] model of bottom-up visual attention and cor- 
ner detection, which allowed successful guessing attacks 
on some images, even with relatively naive dictionary 
strategies. Our most interesting contribution is apply- 
ing a human-seeded attack strategy, by harvesting pass- 
word data in a lab setting from small sets of users, to at- 
tack other field study passwords. Our human-seeded at- 
tack strategy for cued-recall graphical passwords is sim- 


ilar to Davis et al.’s attack [8] against recognition-based 
graphical passwords; notable differences include a more 
straightforward dictionary generation method, and that 
our seed data is from a separate population and (short- 
term) setting. 


The remainder of this paper is organized as follows. 
Section 2 provides background and terminology. Section 
3 presents our lab-controlled user study, and an analysis 
of observed hot-spots and the distribution of user click- 
points. Section 4 presents results on the larger (field) 
user study, and of our password harvesting attacks. Sec- 
tion 5 explores our use of image processing methods to 
expedite guessing attacks on the 17 images from the first 
user study and the two from the second user study. Re- 
lated work is briefly discussed in Section 6. Section 7 
provides further discussion and concluding remarks. 


2 Background and Terminology 


Click-based graphical passwords require users to log in 
by clicking a sequence of points on a single background 
image. Many variations are possible (see Section 6), de- 
pending on what points a user is allowed to select. We 
study click-based graphical passwords by allowing clicks 
anywhere on the image (i.e., PassPoints-style). We be- 
lieve that most findings related to hot-spots in this style 
will apply to other variations using the same images, as 
the “interesting” clickable areas are still present. 


We use the following terminology. Assume a user 
chooses a given click-point c as part of their password. 
The tolerable error or tolerance t is the error allowed 
for a click-point entered on a subsequent login to be ac- 
cepted as c. This defines a tolerance region (T-region) 
centered on c, which for our implementation using t = 9 
pixels, is a 19 x 19 pixel square. A cluster is a set of 
one or more click-points that lie within a T-region. The 
number of click-points belonging to a cluster is its size. 
A hot-spot is indicated by a cluster that is large, relative 
to the number of users in a given sample. To aid visu- 
alization and indicate relative sizes for clusters of size at 
least two, on figures we sometimes represent the underly- 
ing cluster by a shaded circle or halo with halo diameter 
proportional to its size. An alphabet is a set of distinct 
T-regions; our implementation, using 451 x 331 pixel im- 
ages, results in an alphabet of m = 414 T-regions. Using 
passwords composed of 5-clicks, on an alphabet of size 
414 provides the system with only a 43-bit full theoret- 
ical password space; we discuss the implications of this 
in Section 7. 
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3 Lab Study and Clustering Analysis 


Here we report on the results of a university-approved 
43-user study of click-based graphical passwords in a 
controlled lab environment. Each user session was con- 
ducted individually and lasted about one hour. Partici- 
pants were all university students who were not studying 
(or experts in) computer security. Each user was asked 
to create a click-based graphical password on 17 differ- 
ent images (some of these are reproduced herein; others 
are available from the authors). Four of the images are 
from a previous click-based graphical password study by 
Wiedenbeck et al. [46]; the other 13 were selected to pro- 
vide a range of values based on two image processing 
measures that we expected to reflect the amount of detail: 
the number of segments found from image segmentation 
[11] and the number of corners found from corner de- 
tection [16]. Seven of the 13 images were chosen to be 
those we “intuitively” believed would encourage fewer 
hot-spots; this is in addition to the four chosen in ear- 
lier research [46] using intuition (no further details were 
provided on their image selection methodology). 

EXPERIMENTAL DETAILS. We implemented a web- 
based experiment. Each user was provided a brief ex- 
planation of what click-based graphical passwords are, 
and given two images to practice creating and confirm- 
ing such passwords. To keep the parameters as consis- 
tent as possible with previous usability experiments of 
such passwords [47], we used d = 5 click-points for 
each password, an image size of 451 x 331 pixels, and a 
19 x 19 pixel square of error tolerance. Wiedenbeck et al. 
[47] used a tolerance of 20 x 20, allowing 10 pixels of tol- 
erated error on one side and 9 on the other. To keep the 
error tolerance consistent on all sides, we approximate 
this error tolerance using 19 x 19. Users were instructed 
to choose a password by clicking on 5 points, with no 
two the same. Although the software did not enforce this 
condition, subsequent analysis showed that the effect on 
the resulting cluster sizes was negligible for all images 
except pcb; for more details, see caption of Figure 1.We 
did not assume a specific encoding scheme (e.g., robust 
discretization [3] or other grid-based methods); the con- 
cept of hot-spots and user choice of click-points is gen- 
eral enough to apply across all encoding schemes. To 
allow for detailed analysis, we store and compare the ac- 
tual click-points. 

Once the user had a chance to practice a few pass- 
words, the main part of the experiment began. For each 
image, the user was asked to create a click-based graph- 
ical password that they could remember but that others 
will not be able to guess, and to pretend that it is pro- 
tecting their bank information. After initial creation, the 
user was asked to confirm their password to ensure they 
could repeat their click-points. On successful confirma- 


tion, the user was given 3D mental rotation tasks [33] 
as a distractor for at least 30 seconds. This distractor 
was presented to remove the password from their visual 
working memory, and thus simulate the effect of the pas- 
sage of time. After this period of memory tasks, the user 
was provided the image again and asked to log in using 
their previously selected password. If the user could not 
confirm after two failed attempts or log in after one failed 
attempt, they were permitted to reset their password for 
that image and try again. If the user did not like the im- 
age and felt they could not create and remember a pass- 
word on it, they were permitted to skip the image. Only 
two images had a significant number of skips: paperclips 
and bee. This suggests some passwords for these images 
were not repeatable, and we suspect our results for these 
images would show lower relative security in practice. 

To avoid any dependence on the order of images pre- 
sented, each user was presented a random (but unique) 
shuffled ordering of the 17 images used. Since most 
users did not make it through all 17 images, the number 
of graphical passwords created per image ranged from 
32 to 40, for the 43 users. Two users had a “jumpy” 
mouse, but we do not expect this to affect our present 
focus — the location of selected click-points. This short- 
term study was intended to collect data on initial user 
choice; although the mental rotation tasks work to re- 
move the password from working memory, it does not 
account for any effect caused by password resets over 
time due to forgotten passwords. The long-term study 
(Section 4) does account for this effect, and we compare 
the results. 


3.1 Results on Hot-Spots and Popular 
Clusters Observed 


To explore the occurrence of hot-spotting in our lab user 
study, we assigned all the user click-points observed in 
the study to clusters as follows. Let R be the raw (unpro- 
cessed) set of click-points, / a list of temporary clusters, 
and V the final resulting set of clusters. 


1. Foreach cy, € R, let By, be a temporary cluster con- 
taining click-point c,. Temporarily assign all user 
click-points in R within c;,’s T-region to B,. Add 
B E to M. 

2. Sort all clusters in M by size, in decreasing order. 

3. Greedily make permanent assignments of click- 
points to clusters as follows. Let By be the largest 
cluster in MM. Permanently assign each click-point 
cr € Bey to Be, then delete each c, € By from all 
other clusters in M. Delete By from M, and add By 
to V. Repeat until (/ is empty. 


This process determines a set V of (non-empty) clus- 
ters and their sizes. We then calculate the observed 
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Figure 1: The five most popular clusters (in terms of size, 
i.e., # of times selected), and # of popular clusters (size > 5). 
Results are from 32-40 users, depending on the image, for the 
final passwords created on each image. For pcb, which shows 
only 6 clusters of size > 5, the size of clusters 2-5 become 5, 
5, 4, and 3 when counting at most one click from each user. 


“probability” p; (based on our user data set) of the clus- 
ter j being clicked, as cluster size divided by total clicks 
observed. When the probability p; of a certain cluster 
is sufficiently high, we can place a confidence interval 
around it for future populations (of users who are similar 
in background to those in our study) using (1) as dis- 
cussed below. 


Each probability p; estimates the probability of a 
cluster being clicked for a single click. For 5-click 
passwords, we approximate the probability that a user 
chooses cluster 7 in a password by P; = 5x p;. Note that 
the probability for a cluster 7 increases slightly as other 
clicks occur (due to the constraint of 5 distinct clusters 
in a password); we ignore this in our present estimate of 
Py 

Our results in Figure | indicate a significant number of 
hot-spots for our sample of the full population (32 — 40 
users per image). Previous “conservative” assumptions 
[47] were that half of the available alphabet of T-regions 
would be used in practice — or 207 in our case. If this 
were the case, and all T-regions in the alphabet were 
equi-probable, we would expect to see some clusters of 
size 2, but none of size 3 after 40 participants; we ob- 
served significantly more on all 17 images. Figure 1 
shows that some images were clearly worse than others. 
There were many clusters of size at least 5, and some 
as large as 16 (see tea image). If a cluster in our lab 
study received 5 or more clicks — in which case we call it 
a popular or high-probability cluster — then statistically, 


this allows determination of a confidence interval, using 
Equation (1) which provides the 100(1—a)% confidence 
interval for a population proportion [9, page 288]. 


Pq 
D+ Za/24/— 
n 





(1) 


Here n is the total number of clicks (i.e., five times the 
number of users), p takes the role of p;, q = 1 — p, 
and z,/2 is from a z-table. A confidence interval can be 
placed around p; (and thus P;) using (1) when np > 5 
and nq > 5. For clusters of size k > 5, p = z then 
np = k and ng = n—k. In our case, n > 32-5 and 
n —k > 5, as statistically required to use (1). 

Table 1 shows these confidence intervals for four im- 
ages, predicting that in future similar populations many 
of these points would be clicked by between 10-50% of 
users, and some points would be clicked by 20-60% of 
users with 95% confidence (a = .05). For example, 
in Table 1(a), the first row shows the highest frequency 
cluster (of size 13); as our sample for this image was only 
35 users, we observed 37.1% of our participants choos- 
ing this cluster. Using (1), between 17.7% and 56.6% of 
users from future populations are expected to choose this 
same cluster (with 95% confidence). 

Figure | and Table 1 show the popularity of the hottest 
clusters; Figure 1’s line also shows the number of pop- 
ular clusters. The clustering effect evident in Figures 1, 
2, and Table 1 clearly establishes that hot-spots are very 
prominent on a wide range of images. We further pursue 
how these hot-spots impact the practical security of full 
5-click passwords in Section 4.2. As a partial summary, 
our results suggest that many images have significantly 
more hot-spots than would be expected if all T-regions 
were equi-probable. The paperclips, cars, faces, and tea 
images are not as susceptible to hot-spotting as others 
(e.g., mural, truck, and philadelphia). For example, the 
cars image had only 4 clusters of size at least 5, and only 
one with frequency at least 10. The mural image had 15 
clusters of size at least 5, and 3 of the top 5 frequency 
clusters had frequency at least 10. Given our sample size 
for the mural image was only 36 users, these clusters are 
quite popular. This demonstrates the range of effect the 
background image can have (for the images studied). 

Although previous work [46] suggests using intuition 
for choosing more secure background images (no further 
detail was provided), our results apparently show that in- 
tuition is not a good indicator. Of the four images used 
in other click-based graphical passwords studies, three 
showed a large degree of clustering (pool, mural, and 
Philadelphia). Furthermore, two other images that we 
“intuitively” believed would be more secure background 
images were among the worst (truck and citymap-nl). 
The truck image had 10 clusters of size at least 5, and the 
top 5 clusters had frequency at least 13. Finding reliable 
automated predictors of more secure background images 
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(b) mural (originally from [46]; see Appendix A). 
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(c) philadelphia (originally from [46]; see Figure 5). 
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(d) truck (originally from [12]). 


Figure 2: Observed click-points. Halo diameters are 10 times the size of the underlying cluster, illustrating its popularity. 






















































































(a) pool image (b) mural image 
Cluster Cluster 
size P; 95% CI (P;) size P; 95% CI (P;) 
13 | 0.371 | (0.177; 0.566) 14 | 0.400 | (0.199; 0.601) 
12 | 0.343 | (0.156; 0.530) 13 | 0.371 | (0.177; 0.566) 
12 | 0.343 | (0.156; 0.530) 10 | 0.286 | (0.114; 0.458) 
11 | 0.314 | (0.134; 0.494) 8 | 0.229 | (0.074; 0.383) 
11 | 0.314 | (0.134; 0.494) 7 | 0.200 | (0.055; 0.345) 
(c) philadelphia image (d) truck image 
Cluster Cluster 
size P; 95% CI (P;) size P; 95% CI (P;) 
10 | 0.286 | (0.114; 0.458) 15 | 0.429 | (0.221; 0.636) 
10 | 0.286 | (0.114; 0.458) 14 | 0.400 | (0.199; 0.601) 
9 | 0.257 | (0.094; 0.421) 13 | 0.371 | (0.177; 0.566) 
9 | 0.257 | (0.094; 0.421) 13 | 0.371 | (0.177; 0.566) 
7 | 0.200 | (0.055; 0.345) 13 | 0.371 | (0.177; 0.566) 











Table 1: 95% confidence intervals for the top 5 clusters found in each of four images. The confidence intervals are for the 


percentage of users expected to choose this cluster in future populations. 
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remains an open problem. Our preliminary work with 
simple measures (image segmentation, corner detection, 
and image contrast measurement) does not appear to of- 
fer reliable indicators. Thus, we next explore the impact 
of hot-spotting across images to help choose two images 
for further analysis. 


3.2 Measurement and Comparison of Hot- 
Spotting for Different Images 


To compare the relative impact of hot-spotting on each 
image studied, we calculated two formal measures of 
password security for each image: entropy H(X), in 
equation (2), and in equation (3), the expected number 
of guesses E(f(X)) to correctly guess a password as- 
suming the attacker knows the probabilities w; > 0 for 
each password i. The relationship between H(X) and 
E(f(X)) for password guessing is discussed by Massey 
[26]. Of course in general, the w; are unknown, and our 
study gives only very coarse estimates; nonetheless, we 
find it helpful to use this to develop an estimate of which 
images will have the least impact from hot-spotting. For 
(2) and (3), n is the number of passwords (of probability 
> 0), random variable X ranges over the passwords, and 
w; = Prob(X = 2;) is calculated as described below. 


H(X) = —5—u;- log(wi) (2) 
i=1 
E(f(X)) = Soi-wi , where w; > wis1,and (3) 
i=l 

f(X) is the number of guesses before success. We calcu- 
late these measures based on our observed user data. For 
this purpose, we assume that users will choose from a set 
of click-points (following the associated probabilities), 
and combine 5 of them randomly. This assumption al- 
most certainly over-estimates both E(f(X)) and H(X) 
relative to actual practice, as it does not consider click- 
order patterns or dependencies. Thus, popular clusters 

likely reduce security by more than we estimate here. 
We define CY to be the set of all 5-permutations 
derivable from the clusters observed in our user study 
(as computed in Section 3.1). Using the probabilities 
p; of each cluster, the probabilities w; of each pass- 
word in CY are computed as follows. Pick a combi- 
nation of 5 observed clusters 71,... ,35 with respec- 
tive probabilities p;1,... ,pj5. For each permutation of 
these clusters, calculate the probability of that permuta- 
tion occurring as a password. Due to our instructions 
that no two click-points in a password can fall in the 
same T-region, these probabilities change as each point 
is clicked. Thus, for password i = (j1,J2,j3,J4,J5)> 
w; = pj [pj2/(1—py)]-[pj3/((A—pyji)-(1—-pj2))]-- --)- 
The resulting set CY is a set of click-based graphical 
passwords (with associated probabilities) that coarsely 


approximates the effective password space if the clusters 
observed in our user study are representative of those in 
larger similar populations. We can order the elements of 
C” using the probabilities w; based on our user study. 
An ordered CY could be used as the basis of an attack 
dictionary; this ordering could be much improved, for 
example, by exploiting expected patterns in click-order. 
See Section 4.2 for more details. 

For comparison to previous “conservative” estimates 
that simply half of the available click-points (our T- 
regions) would be used in practice [47], we calculate C’ a 
We compare to C™ as it is a baseline that approximates 
what we would expect to see after running 32 users (the 
lowest number of users we have for any image), if pre- 
vious estimates were accurate, and T-regions were equi- 
probable. C” is the set of all permutations of clusters 
we expect to find after observing 32 users, assuming a 
uniformly random alphabet of size 207. 

Fig. 3 depicts the entropy and expected number of 
guesses for CY’. Notice the range between images, and 
the drop in E(f(X)) from CY to values of CY. Com- 
parison to the marked CY values for (1) H(X) and (2) 
E(f(X)) indicates that previous rough estimates are a 
security overestimate for practical security in all images, 
some much more so than others. This is at least partially 
due to click-points not being equi-probable in practice 
(as illustrated by hot-spots), and apparently also due to 
the previously suggested effective alphabet size (half of 
the full alphabet) being an overestimate. Indeed, a large 
alphabet is precisely the theoretical security advantage 
that these graphical passwords have over text passwords. 
If the effective alphabet size is not as large as previously 
expected, or is not well-distributed, then we should re- 
duce our expectations of the security. 

These results appear to provide fair approximation 
of the entropy and expected number of guesses for the 
larger set of users in the field study; we performed 
these same calculations again using the field study data. 
For both of the two images, the entropy measures were 
within one bit of values measured here (less than a bit 
higher for pool, and about one bit lower for cars). The 
number of expected guesses increased for both images 
(by 1.3 bits for cars, and 2.5 bits for pool). 

The variation across all images shows how much of an 
impact the background image can be, even when using 
images that are “intuitively” good. For example, the im- 
age that showed the most impact from hot-spotting was 
the mural image, chosen for an earlier PassPoints usabil- 
ity study [46]. We note that the paperclips image scores 
best in the charted security measures (its H(X) mea- 
sure is within a standard deviation of CY); however, 8 
of 36 users who created a password on this image could 
not perform the subsequent login (and skipped it — as 
noted earlier), so the data for this image represents some 
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Figure 3: Security measures for each image (in bits). CY 
is based on data from lab user study of 32-40 passwords (de- 
pending on image). For comparison to a uniform distribution, 
(1) marks H(X) for CY, and (2) marks E(f(X)) for CY. 


passwords that are not repeatable, and thus we suspect it 
would have lower relative security in practice. 

Overall, one can conclude that image choice can have 
a significant impact on the resulting security, and that de- 
veloping reliable methods to filter out images that are the 
most susceptible to hot-spotting would be an interesting 
avenue for future research. 

We used these formal measures to make an informed 
decision on which images to use for our field study. Our 
goal was to give the PassPoints scheme the best chance 
(in terms of anticipated security) we could, by using one 
image (cars) that showed the least amount of clustering 
(with the best user success in creating a password), and 
also using another that ranked in the middle (pool). 


4 Field Study and Harvesting Attacks 


Here we describe a 7-week or longer (depending on the 
user), university-approved field study of 223 user ac- 
counts on two different background images. We col- 
lected click-based graphical password data to evaluate 
the security of this style of graphical passwords against 
various attacks. As discussed, we use the entropy and 
expected guesses measures from our lab study to choose 
two images that would apparently offer different levels 
of security (although both are highly detailed): pool and 
cars. The pool image had a medium amount of cluster- 
ing (cf. Fig. 3), while the cars image had nearly the least 
amount of clustering. Both images had a low number of 
skips in the lab study, indicating that they did not cause 
problems for users with password creation. 
EXPERIMENTAL DETAILS. We implemented a web- 
based version of PassPoints, used by three first-year un- 
dergraduate classes: two were first year courses for com- 


puter science students, while the third was a first year 
course for non-computer science students enrolled in a 
science degree. The students used the system for at least 
7 weeks to gain access to their course notes, tutorials, 
and assignment solutions. For comparison with previ- 
ous usability studies on the subject, and our lab study, 
we used an image size of 451 x 331 pixels. After the 
user entered their username and course, the screen dis- 
played their background image and a small black square 
above the image to indicate their tolerance square size. 
For about half of users (for each image), a 19 x 19 T- 
region was used, and for the other half, a 13 x 13 T- 
region.” The system enforced that each password had to 
be 5 clicks and that no click-point could be within t = 9 
pixels of another (vertically and horizontally). To com- 
plete initial password creation, a user had to successfully 
confirm their password once. After initial creation, users 
were permitted to reset their password at any time using 
a previously set secret question and answer. 

Users were permitted to login from any machine 
(home, school, or other), and were provided an online 
FAQ and help. The users were asked that they keep 
in mind that their click-points are a password, and that 
while they will need to pick points they can remember, 
not to pick points that someone else will be able to guess. 
Each class was also provided a brief overview of the 
system, explaining that their click-points in subsequent 
logins must be within the tolerance shown by a small 
square above the background image, and that the input 
order matters. We only use the final passwords created 
by each user that were demonstrated as successfully re- 
called at least one subsequent time (i.e., at least once af- 
ter the initial create and confirm). We also only use data 
from 223 out of 378 accounts that we would consider, as 
this was the number that provided the required consent. 
These 223 user accounts map to 189 distinct users as 34 
users in our study belonged to two classes; all but one 
of these users were assigned a different image for each 
account, and both accounts for a given user were set to 
have the same error tolerance. Of the 223 user accounts, 
114 used pool and 109 used cars as a background image. 


4.1 Field Study Hot Spots and Relation to 
Lab Study Results 


Here we present the clustering results from the field 
study, and compare results to those on the same two im- 
ages from the lab study. Fig. 4b shows that the areas 
that were emerging as hot-spots from the lab study (re- 
call Fig. 2a) were also popular in the field study, but other 
clusters also began to emerge. Fig. 4b shows that even 
our “best” image from the lab study (in terms of apparent 
resistance to clustering) also exhibits a clustering effect 
after gathering 109 passwords. Table 2 provides a closer 
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examination of the clustering effect observed. 

















Image | Size of most popular clusters | # clusters 
Name | #1|#2)#3/#4 #5 of size 

>5 
cars 26 | 25) 24] 22 22 32 
pool 35 | 30] 30] 27 27 28 





























Table 2: Most popular clusters (field study). 


These values show that on pool, there were 5 points 
that 24-31% of users chose as part of their password. On 
cars, there were 5 points that 20-24% of users chose as 
part of their password. The clustering on the cars im- 
age indicates that even highly detailed images with many 
possible choices have hot spots. Indeed, we were sur- 
prised to see a set of points that were this popular, given 
the small amount of observed clustering on this image 
from our smaller lab study. 

The prediction intervals calculated from our lab study 
(recall Section 3) provide reasonable predictions of what 
we observed in the field study. For cars, the prediction 
intervals for 3 out of the 4 popular clusters were correct. 
For pool, the prediction intervals for 8 out of the 9 popu- 
lar clusters were correct. The anomalous cluster on cars 
was still quite popular (chosen by 12% of users), but the 
lower end of the lab study’s prediction interval for this 
cluster was 20%. The anomalous cluster on pool was 
also still quite popular (chosen by 18% of users), but the 
lower end of the lab study’s prediction interval for this 
cluster was 19%. 

These clustering results (and their close relationship 
to the lab study’s results) indicate that the points chosen 
from the lab study should provide a reasonably close ap- 
proximation of those chosen in the field. This motivates 
our attacks based on the click-points harvested from the 
lab study. 


4.2 Harvesting Attacks: Method & Results 


We hypothesized that due to the clustering effect we ob- 
served in the lab study, human-seeded attacks based on 
data harvested from other users might prove a successful 
attack strategy against click-based graphical passwords. 
Here we describe our method of creating these attacks, 
and our results are presented below. 

Table 3 provides the results of applying various at- 
tack dictionaries based on our harvested data, and their 
success rates when applied to our field study’s password 
database.* 

CE is a dictionary composed of all 5-permutations 
of click-points collected from u users. Note C¥ bit- 
size is a slight overestimate, as there are some combi- 
nations of points that would not constitute a valid pass- 
word, due to two or more points being within t = 9 pix- 
els of each other. If this were taken into account, our 


attacks would be slightly better. In our lab study, u = 33 
for cars, and u = 35 for pool. Thus, the size of Ce 
for cars is P(165,5) = 2%°” entries, and for pool is 
P(175,5) = 2371 entries. CY is a dictionary composed 
of all 5-permutations of the clusters calculated (using the 
method described in Section 3.1) from the click-points 
from u users. Thus, the alphabet size (and overall size) 
for CY is smaller under the same number of users than in 
a corresponding C# dictionary. Note that all of these dic- 
tionary sets can be computed on-the-fly from base data as 
necessary, and thus need not be stored. 

Table 3 illustrates the efficacy of seeding a dictionary 
with a small number of user’s click-points. The most 
striking result shown is that initial password choices har- 
vested from 15 users, in a setting where long term re- 
call is not required, can be used to generate (on average) 
27% of user passwords for pool (see Cf). As we ex- 
pected, cars was not as easily attacked as pool; more user 
passwords are required to seed a dictionary that achieves 
similar success rates (see CH). 

We also tried these attacks using a small set of field 
study user passwords to seed an attack against the re- 
maining field study user passwords. The result, in Ta- 
ble 4, shows a difference between the lab study and the 
field study (final) passwords; however, there remains suf- 
ficient similarity between the two groups to launch ef- 
fective attacks using the lab-harvested data. One pos- 
sible reason for the differences in user choice between 
the two studies is that the field study users may not have 
been as motivated as the lab study users to create “dif- 
ficult to guess” graphical passwords. It is unclear how 
a user might measure whether they are creating a graph- 
ical password that is difficult to guess, and whether in 
trying, if users would actually change their password’s 
strength; one study [36] shows that only 40% of users 
actually change the complexity of their text passwords 
according to the security of the site. Another equally 
possible explanation might be that the lab study users 
chose more difficult passwords than they would have in 
practice, as they were aware there was no requirement 
for long term recall, and also did not have a chance to 
forget and subsequently reset their passwords to some- 
thing more memorable. With our current data, it is not 
clear whether we can conclusively determine a reason 
for these differences. 

Next we examined the effect of click-order patterns 
as one method to capture a user’s association between 
points, and reduce our dictionary sizes. For each image, 
we select one dictionary to optimize with click-order pat- 
terns. This dictionary is one of the ten randomly selected 
C” subsets that were averaged (results of this average 
are in Table 3). We selected the dictionary whose guess- 
ing success was closest to the average reported in Table 
3. The success rate that these dictionaries achieve (be- 
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(a) cars (originally from [5]). 
Figure 4: Observed clustering (field study). Halo diameter is 5x the number of underlying clicks. 


(b) pool (originally from [46, 47]). 


















































Set cars (u = 33) pool (u = 35) 
m. | bitsize # passwords m. | bitsize # passwords 

guessed out of 109 guessed out of 114 

avg min max avg min max 
Ce | 165 36.7 | 37(34%) T yt | 175 37.1 | 59(52%) t t 
CY | 104 33.4 | 22(20%) T Tel cee 31.1 | 416%) t t 
Ce) 125 34.7 | 24(22%) | 9(8%) | 35(32%) | 125 34.7 | 42(37%) | 29(25%) | 56(49%) 
Cy. | 85 31.9 | 21(19%) | 7(6%) | 27(25%) | 59 29.2 | 34(29%) | 19117%) | 47(41%) 
Ce. | 1606 33.1 | 22(20%) | 8(7%) | 32(29%) | 100 33.1 | 35(31%) | 24(21%) | 55(48%) 
Cy | 2 30.6 | 17(16%) | 8(7%) | 30(28%) | 52 28.2 | 28(25%) | 18(16%) | 43(38%) 
Ce “95 30.9 | 14(13%) | 4(4%) | 25(23%) | 75 30.9 | 30(27%) | 20(18%) | 45(39%) 
Cy | 56 28.8 | 12(11%) | 4(4%) | 24(22%) | 41 26.4 | 26(23%) | 14(12%) | 43(38%) 









































Table 3: Dictionary attacks using different sets. All subsets of users (after the first two rows) are the result of 10 randomly selected 
subsets of wu short-term study user passwords. For rows | and 2, note that u = 33 and 35. m is the alphabet size, which defines 
the dictionary bitsize. See text for descriptions of CY and C™. {The first two rows use all data from the short-term study to seed a 
single dictionary, and as such, there are no average, max, or min values to report. 


fore applying click-order patterns) is provided in the first 
row of Table 5. 


We hypothesized that many users will choose pass- 
words in one (or a combination) of six simple click- 
order patterns: right to left (RL), left to right (LR), top to 
bottom (TB), bottom to top (BT), clockwise (CW), and 
counter-clockwise (CCW). Diagonal (DIAG) is a com- 
bination of a consistent vertical and horizontal direction 
(e.g., both LR and TB). Note that straight lines also fall 
into this category; for example, when (x;, y;) is a hor- 
izontal and vertical pixel coordinate, the rule for LR is 
(%1 < 2 < 43 < 44 < 25), SO a Vertical line of 
points would satisfy this constraint.We apply our base 
attack dictionaries (one for each image), under various 
sets of these click-order pattern constraints to determine 
their success rates and dictionary sizes. This method 
only initiates the exploration of other ways that click- 


based graphical passwords could be analyzed for patterns 
in user choice. We expect this general direction will yield 
other results, including patterns due to mnemonic strate- 
gies (e.g., clicking all red objects). 


The results shown in Table 5 indicate that, on aver- 
age for the pool image, using only the diagonal con- 
straint will reduce the dictionary size to 16 bits, while 
still cracking 12% of passwords. Similarly, for the cars 
image, using only this constraint will reduce the dictio- 
nary to 18 bits, while still cracking 10% of passwords. 
The success rate of our human-seeded attack is compa- 
rable to recent results on cracking text-based passwords 
[23], where 6% of passwords were cracked with a 1.2 
million entry dictionary (almost 2 bits larger than our 
DIAG dictionary based on harvested points of 15 users 
for cars, and 4 bits larger for DIAG based on 15 users 
for pool). Furthermore, unlike most text dictionaries, we 
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Dictionary cars pool 
m | bitsize | # passwords m | bitsize # passwords 
guessed guessed 
Crislon item 100 33.1 | 29/89 (33%) | 100 33.1 | 52/94 (55%) 
iodongieer.| 9 27.9 | 23/99 (23%) | 50 27.9 | 22/104 (21%) 








Table 4: Dictionary attack results, using the first 20 and 10 users from the long term study to seed an attack against the others. m 
is the alphabet size. See text for descriptions of CY and C®. 



































cars image pool image 

Click-order pattern # passwords | dictionary # passwords | dictionary 

guessed of 109 | size (bits) || guessed of 114 | size (bits) 
Cr (with no pattern) 13 (12%) 29.2 22 (19%) 27.1 
LR, RL, CW, CCW, TB, BT 12 (11%) 25.6 22 (19%) 23.4 
LR, RL 11 (10%) 23.8 19 (17%) 22.0 
TB,BT 12 (11%) 24.4 15 (13%) 21.9 
CW, CCW 0 (0%) 24.0 4 (4%) 21.7 
DIAG 11 (10%) 18.4 14 (12%) 16.2 




















Table 5: Effect of incorporating click-order patterns on dictionary size and success, as applied to a representative dictionary 
of clusters gathered from 15 users. Results indicate that the DIAG pattern produces the smallest dictionary, and still guesses a 


relatively large number of passwords. 


do not need to store the entire dictionary as it is generated 
on-the-fly from the alphabet. At best, this indicates that 
these graphical passwords are slightly less secure than 
the text-based passwords they have been proposed to re- 
place. However, the reality is likely worse. The anal- 
ogy to our attack is collecting text passwords from 15 
users, and generating a dictionary based on all permuta- 
tions of the characters harvested, and finding it generated 
a successful attack. The reason most text password dic- 
tionaries succeed is due to known dependent patterns in 
language (e.g., using di or tri-grams in a Markov model 
[29]). The obvious analogy to this method has not been 
yet attempted, but would be another method of further 
reducing the dictionary size. 


5 Purely Automated Attacks Using Image 
Processing Tools 


Here we investigate the feasibility of creating an at- 
tack dictionary for click-based graphical passwords by 
purely automated means. Pure automation would side- 
step the need for human-seeding (in the form of harvest- 
ing points), and thus should be easier for an attacker to 
launch than the attacks presented in Section 4. We create 
this attack dictionary by modelling user choice using a 
set of image processing methods and tools. The idea is 
that these methods may help predict hot-spots by auto- 
mated means, leading to more efficient search orderings 
for exhaustive attacks. This could be used for modeling 


attackers constructing attack dictionaries, and proactive 
password checking. 


5.1 Identifying Candidate Click-Points 


We begin by identifying details of the user task in cre- 
ating a click-based graphical password. The user must 
choose a set of points (in a specific order) that can be re- 
membered in the future. We do not focus on mnemonic 
strategies for these automated dictionaries (although they 
could likely be improved using the click-order patterns 
from Section 4.2), but rather the basic features of a point 
that define candidate click-points. To this end, we iden- 
tify a candidate click-point to be a point which is: (1) 
identifiable with precision within the system’s error tol- 
erance; and (2) distinguishable from its surroundings, 
i.e., easily picked out from the background. Regarding 
(1), as an example, the pool image has a red garbage can 
that is larger than the 19 x 19 error tolerance; to choose 
the red garbage can, a user must pick a specific part of it 
that can be navigated to again (on a later occasion) with 
precision, such as the left handle. Regarding (2), as an 
example, it is much easier to find a white logo on a black 
hat than a brown logo on a green camouflage hat. 

For modelling purposes, we hypothesize that the fewer 
candidate click-points (as defined above) that an image 
has, the easier it is to attack. We estimate candidate click- 
points by implementing a variation of Itti et al.’s bottom- 
up model of visual attention (VA) [17], and combining it 
with Harris corner detection [16]. 
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Corner detection picks out the areas of an image that 
have variations of intensity in horizontal and vertical di- 
rections; thus we expect it should provide a reasonable 
measure of whether a point is identifiable. Itti et al.’s 
VA determines areas that stand out from their surround- 
ings, and thus we expect it should provide a reasonable 
measure of a point’s distinguishability. Briefly, VA cal- 
culates a saliency map of the image based on 3 channels 
(color, intensity, and orientation) over multiple scales. 
The saliency map is a grayscale image whose brighter 
areas (i.e., those with higher intensity values) represent 
more conspicuous locations. A viewer’s focus of atten- 
tion should theoretically move from the most conspicu- 
ous locations (represented by the highest intensity areas 
on the saliency map) to the least. We assume that users 
are more likely to choose click-points from areas which 
draw their visual attention. 

We implemented a variation of VA and combined it 
with Harris corner detection to obtain a prioritized list 
of candidate click-points (CCP-list) as follows. (1) Cal- 
culate a VA saliency map (see Fig. 5(b)) using slightly 
smaller scales than Itti et al. [17] (to reflect our interest 
in smaller image details). The higher-intensity pixel val- 
ues of the saliency map reflect the most “conspicuous” 
(and distinguishable) areas. (2) Calculate the corner lo- 
cations using the Harris corner detection function as im- 
plemented by Kovesi [22]* (see Fig. 5(c)). (3) Use the 
corner locations as a bitmask for the saliency map, pro- 
ducing what we call a cornered saliency map (CSM). (4) 
Compute an ordered CCP-list of the highest to lowest 
intensity-valued CSM points. Similar to the focus-of- 
attention inhibitors used by Itti et al., we inhibit a CSM 
point (and its surrounding tolerance) once it has been 
added to the CCP-list so it is not chosen again (see Fig. 
5(d)). The CCP-list is at least as long as the alphabet size 
(414), but is a prioritized list, ranking points from (the 
hypothesized) most to least likely. 


5.2 Model Results 


We evaluated the performance of the CCP-list as a model 
of user choice using the data from both the lab and field 
user studies. We first examined how well the first half 
(top 207) of the CCP-list overlaps with the observed 
high-probability clusters from our lab user study (ie., 
those clusters of size at least 5). We found that this half- 
alphabet found all high-probability clusters on the icons, 
Jaces, and cars images, and most of the high-probability 
clusters on 11 of the 17 images. Most of the images that 
our model performed poorly on appeared to be due to the 
saliency map algorithm being overloaded with too much 
detail (pcb, citymap-gr, paperclips, smarties, and truck 
images). The other image on which this approach did 
not perform well (mural) appears to be due to the cor- 


ner masking in step (3); the high probability points were 
centroids of circles. 

To evaluate how well the CCP-list works at modelling 
users’ entire passwords (rather than just a subset of click- 
points within a password), we used the top ranked one- 
third of the CCP-list values (i.e., the top 138 points for 
each image) to build a graphical dictionary and carry out 
a dictionary attack against the observed passwords from 
both user studies (i.e., on all 17 images in the lab study, 
and the cars and pool images again in the field study). 
We found that for some images, this 35-bit dictionary 
was able to guess a large number of user passwords (30% 
for the icons image and 29% for the philadelphia map 
image). For both short and long-term studies, our tool 
guessed 9.1% of passwords for the cars image. A 28- 
bit computer-generated dictionary (built from the top 51 
ranked CCP-list alphabet) correctly guessed 8 passwords 
(22%) from the icons image and 6 passwords (17%) from 
the philadelphia image. Results of this automated graph- 
ical dictionary attack are summarized in Table 6. 




































































Image passwords passwords 
guessed guessed 

(lab study) (field study) 

1. paperclips 2/36 (5.5%) - 
2. cdcovers 2/35 (5.7%) - 
3. philadelphia | 10/35 (28.6%) - 
4. toys 2/39 (5.1%) - 
5. bee 1/40 (2.5%) - 
6. faces 0/32 (0.0%) - 
7. citymap-nl 1/34 (2.9%) - 
8. icons 11/37 (29.7%) - 
9. smarties 5/37 (13.5%) - 
10. cars 3/33 (9.1%) | 10/109 (9.1%) 
11. pcb 3/36 (8.3%) - 
12. citymap-gr 0/34 (0.0%) S 
13. pool 1/35 (2.9%) | 2/114 (0.9%) 
14. mural 1/36 (2.8%) - 
15. corinthian 3/35 (8.6%) - 
16. truck 1/35 (2.9%) = 
17. tea 2/38 (5.3%) - 








Table 6: Passwords correctly guessed (using a 35-bit dictio- 
nary based on a CCP-list). The number of target passwords is 
different for most images (32 to 40 for the lab study). 


Figure 6 shows that the CCP-list does a good job of 
modelling observed user choices for some images, but 
not all images. This implies that on some images, an at- 
tacker performing an automated attack is likely to be able 
to significantly cut down his search space. This method 
also seems to perform well on the images for which the 
visual attention model made more definite decisions — the 
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(a) Original image [46]. 
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(c) Corner detection output. 





(d) Cornered saliency map (CSM) after top 51 CCP-list 
points have been inhibited. 


Figure 5: Illustration of our method of creating a CCP-list (best viewed electronically). 


saliency map shows a smaller number of areas standing 
out, as indicated visually by a generally darker saliency 
map with a few high-intensity (white) areas. An attacker 
interested in any one of a set of accounts could go after 
accounts using a background image that the visual atten- 
tion model performed well on. 


In essence, this method achieves a reduction (by leav- 
ing out some “unlikely” points) from a 43-bit full pass- 
word space to a 35-bit dictionary. The 43-bit full pass- 
word space is the proper base for comparison here, since 
an actual attacker with no a priori knowledge must con- 
sider all T-regions in an image. However, we believe 
this model of candidate click-points could be improved 
through a few methods. The images that the model per- 
formed poorly on appeared to be due to failure in cre- 
ating a useful visual attention model saliency map. The 
saliency maps seem to fail when there are no areas that 
stand out from their surroundings in the channels used in 
saliency map construction (color, intensity, and orienta- 
tion). Further, centroids of objects that “stand out” to a 


user will not be included in this model (as only corners 
are included); adding object centroids to the bitmask is 
thus an avenue for improvement. 


6 Related Work 


In the absence of password rules, practical text password 
security is understood to be weak due to common pat- 
terns in user choice. In a dated but still often cited study, 
Klein [21] determined a dictionary of 3 million words 
(ess than 1 billionth of the entire 8-character password 
space) correctly guessed over 25% of passwords. Auto- 
mated password cracking tools and dictionaries that ex- 
ploit common patterns in user choice include Crack [28] 
and John the Ripper [30]. More recently, Kuo et al. [23] 
found John the Ripper’s English dictionary of 1.2 mil- 
lion words correctly guessed 6% of user passwords, and 
an additional 5% by also including simple permutations. 
In response to this well-known threat, methods to cre- 
ate less predictable passwords have emerged. Yan [48] 
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explores the use of passphrases to avoid password dic- 
tionary attacks. Jeyaraman et al. [20] suggest basing a 
passphrase upon an automated newspaper headline. In 
theory, creating passwords using these techniques should 
leave passwords less vulnerable to automated password 
cracking dictionaries and tools, although Kuo et al. [23] 
show this may not be the case. Proactive password 
checking techniques (e.g., [38, 7, 2]) are commonly used 
to help prevent users from choosing weak passwords. 

Many variations of graphical passwords are discussed 
in surveys by Suo et al. [39] and Monrose et al. [27]. We 
discuss two general categories of graphical passwords: 
recognition-based and recall-based. In the interest of 
brevity, we focus on the areas closest to our work: click- 
based graphical passwords, and practical security analy- 
ses of user authentication methods. 

Typical recognition-based graphical passwords re- 
quire the user to recognize a set (or subset) of K pre- 
viously memorized images. For example, the user is pre- 
sented a set of N (> K) images from which they must 
distinguish a subset of their K images. The user may 
be presented many panels of images before providing 
enough information to login. Examples are Déja Vu [10], 
which uses random art images created by hash visualiza- 
tion [32]; Passfaces [35], whereby the set of images are 
all human faces; and Story [8], whereby the images are 
from various photo categories (e.g., everyday objects, lo- 
cations, food, and people), with users encouraged to cre- 
ate a story as a mnemonic strategy. In the cognitive au- 
thentication scheme of Weinshall [44], a user computes 
a path through a grid of images based on the locations of 
those from ’. The end of the path provides a number for 
the user to type, which was thought to protect the values 
of KK from observers; Golle et al. [14] show otherwise. 

Recall-based schemes can be further described as cued 
or uncued. An uncued scheme does not provide the 
user any information from which to create their graphical 
password; e.g., DAS (Draw-A-Secret) [19] asks users to 
draw a password on a background grid. Cued schemes 
show the user something that they can base their graphi- 
cal password upon. A click-based password using a sin- 
gle background image is an example of a cued graphical 
password scheme where the user password is a sequence 
of clicks on a background image. Blonder [4] originally 
proposed the idea of a graphical password with a click- 
based scheme where the password is one or more clicks 
on predefined image regions. In the Picture Password 
variation by Jansen et al. [18], the entire image is over- 
layed by a visible grid; the user must click on the same 
grid squares on each login. 

Birget et al. [3] allow clicking anywhere on an image 
with no visible grid, tolerating error through “robust dis- 
cretization”. Wiedenbeck et al. [45, 46, 47] implement 
this method as PassPoints, and study its usability includ- 


ing: memorability, general perception, error rates, the ef- 
fect of allowed error tolerance, the effect of image choice 
on usability, and login and creation times. They report 
the usability of PassPoints to be comparable to text pass- 
words in most respects; the notable exception is a longer 
time for successful login. The implementation we study 
herein is also reported to have acceptable success rates, 
accuracy, and entry times [6]. 

Regarding explorations of the effect of user choice, 
Davis et al. [8] examine this in a variation of Passfaces 
and Story (see above), two recognition-based schemes 
which essentially involve choosing an image from one or 
more panels of many different images. Their user study 
found very strong patterns in user choice, e.g., the ten- 
dency to select images of attractive people, and those 
of the same racial background. The high-level idea of 
finding and exploiting patterns in user choice also mo- 
tivated our current work, although these earlier results 
do not appear directly extendable to (cued recall) click- 
based schemes that select unrestricted areas from a sin- 
gle background image. Thorpe et al. [41, 42] discussed 
likely patterns in user choice for DAS (mirror symme- 
try and small stroke count), later corroborated through 
Tao’s user study [40]. These results also do not appear to 
directly extend to our present work, aside from the com- 
mon general idea of attack dictionaries. 

Lopresti et al. [24] introduce the concept of generative 
attacks to behavioral biometrics. Ballard et al. [1] gen- 
erate and successfully apply a generative handwriting- 
recognition attack based on population statistics of hand- 
writing, collected from a random sample of 15 users with 
the same writing style. In arguably the most realistic 
study to date of the threats faced by behavioral biomet- 
rics, they found their generative attacks to be more ef- 
fective than attacks by skilled and motivated forgers [1]. 
Our most successful attack from Section 4.2 may also 
be viewed as generative in nature; it uses click-points 
harvested from a small population of users from another 
context (the lab study), performs some additional pro- 
cessing (clustering), and recombines subsets of them as 
guesses. Our work differs in its application (click-based 
graphical passwords), and in the required processing to 
generate a login attempt. 


7 Discussion and Concluding Remarks 


Our results demonstrate that harvesting data from a 
small number of human users allows quite effective of- 
fline guessing attacks against click-based graphical pass- 
words. This makes individual users vulnerable to tar- 
geted (spear) attacks, as one should assume that an at- 
tacker could find out the background image associated 
with a target victim, and easily gather a small set of 
human-generated data for that image by any number of 
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means. For instance, an attacker could collect points by 
protecting an attractive web service or contest site with 
a graphical password. Alternatively, an attacker could 
pay a small group of people or use friends. This at least 
partially defeats the hope to improve one’s security in a 
click-based scheme through a customized image. 

We found that our human-seeded attack strategy was 
quite successful, guessing 36% of passwords with a 31- 
bit dictionary in one instance, and 20% of passwords 
with a 33-bit dictionary in another. Preliminary work 
shows that click-order patterns can be used to further 
reduce the size of these dictionaries, while maintaining 
similar success rates. The success of our human-seeded 
attack dictionaries appears to be related to the amount of 
hot-spotting on an image. The prevalence and impact of 
hot-spots contrasts earlier views which underplayed their 
potential impact, and suggestions [47] that any highly 
detailed image may be a good candidate. Our studies 
allow us to update previous assumptions that half of all 
click-regions on an image will be chosen by users. Af- 
ter collecting 570 and 545 points, we only observed 111 
and 133 click-regions (for pool and cars respectively); 
thus, one quarter to one third of all click-regions would 
be a more reasonable estimate even from highly detailed 
images, and the relative probabilities of these regions 
should be expected to vary quite considerably. 

Our purely automated attack using a combination of 
image processing measures (which likely can be consid- 
erably improved) already gives cause for concern. For 
images on which Itti et al.’s [17] visual attention model 
worked well, our model appeared to do a reasonable job 
of predicting user choice. For example, an automatically- 
generated 28-bit dictionary from our tools guessed 8 out 
of 37 (22%) observed passwords for the icons image, and 
6 out of 35 (17%) for the philadelphia image. Our tools 
guessed 9.1% of passwords for the cars image in both 
the short-term lab and long-term field studies. Improve- 
ments to pursue include adding object centroids to the 
bitmask used in creating the cornered saliency map. 

Our attack strategies (naturally) could be used defen- 
sively, as part of proactive password checking [38, 7, 2]. 
Thus, an interesting avenue for future work would be to 
determine whether graphical password users create other 
predictable patterns when their choices are disallowed 
by proactive checking. Additionally, the visual attention 
model may be used proactively to determine background 
images to avoid, as those images on which the visual at- 
tention model performed well (e.g., identifies some areas 
as much more interesting than others) appear more vul- 
nerable to the purely automated attacks from Section 5. 

An interesting remaining question is whether altering 
parameters (e.g., pixel sizes of images, tolerance settings, 
number of click-points) in an attempt to improve security 
can result in a system with acceptable security and us- 


ability simultaneously. Any proposal with significantly 
varied parameters would require new user studies explor- 
ing hot-spotting and usability. 

Overall, the degree of hot-spotting confirmed by our 
studies, and the successes of the various attack strate- 
gies herein, call into question the viability of click-based 
schemes like PassPoints in environments where off-line 
attacks are possible. Indeed in such environments, a 
43-bit full password space is clearly insufficient to start 
with, so one would assume some tolerable level of pass- 
word stretching (e.g., [15, 34]) would be implemented 
to increase the difficulty of attack. Regardless of these 
implementation details, click-based graphical password 
schemes may still be a suitable alternative for systems 
where offline attacks are not possible, e.g., systems cur- 
rently using PIN numbers. 
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Notes 


"Version: May 13, 2007. A preliminary version of this pa- 
per was available as a Technical Report [43]. 

? Analysis showed little difference between the points cho- 
sen for these different tolerance groups. 

3 preliminary version [43] had a small technical error 
causing some numbers to be less than shown herein in Tables 3 
and 5. 

4As harris(image, 1, 1000, 3) 
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Abstract 


We revisit the venerable question of “pure password’- 
based key derivation and encryption, and expose security 
weaknesses in current implementations that stem from 
structural flaws in Key Derivation Functions (KDF). We 
advocate a fresh redesign, named Halting KDF (HKDF), 
which we thoroughly motivate on these grounds: 

1. By letting password owners choose the hash itera- 
tion count, we gain operational flexibility and eliminate 
the rapid obsolescence faced by many existing schemes. 

2. By throwing a Halting-Problem wrench in the 
works of guessing that iteration count, we widen the se- 
curity gap with any attacker to its theoretical optimum. 

3. By parallelizing the key derivation, we let legiti- 
mate users exploit all the computational power they can 
muster, which in turn further raises the bar for attackers. 

HKDFs are practical and universal: they work with 
any password, any hardware, and a minor change to 
the user interface. As a demonstration, we offer real- 
world implementations for the TrueCrypt and GnuPG 
packages, and discuss their security benefits in concrete 
terms. 


1 Introduction 


For a variety of reasons, it is becoming increasingly de- 
sirable for people leading an electronic lifestyle to attend 
to a last bastion of privacy: a stronghold defended by 
secret-key cryptography, and whose key exists only in 
its guardian’s mind. To this end, we study how “pure 
password” -based encryption can best withstand the most 
dedicated offline dictionary attacks—tregardless of pass- 
word strength. 


1.1 


Passwords. Passwords in computer security are the 
purest form of secrets that can be kept in human memory, 


Human-memorable Secrets 


independently of applications and infrastructures. They 
can be typed quickly and discreetly on a variety of de- 
vices, and remain effective in constrained environments 
with basic input and no output capabilities. Not sur- 
prisingly, passwords and passphrases have become the 
method of choice for human authentication and mental 
secret safekeeping, whether locally or remotely, in an on- 
line or offline setting. 

Passwords have the added benefit to work on diminu- 
tive portable keypads that never leave the user’s con- 
trol, guaranteeing that the secret will not be intercepted 
by a compromised terminal. User-owned and password- 
activated commercial devices include the DigiPass [12] 
for authorizing bank transactions, the CryptoCard [10] 
for generating access tokens, and the ubiquitous cellular 
phone which can be used for making payments via SMS 
over the GSM network. 

Nevertheless, the widespread use of passwords for se- 
curing computer systems is often deplored by system ad- 
ministrators, due to their low entropy and a propensity to 
being forgotten unless written down, which in turn leads 
to onerous policies that users deem too difficult to fol- 
low [43]. In this work, by contast, we seek not to change 
people’s habits in significant ways; rather, our goal is to 
maximize security for passwords that are actually used, 
no matter how weak these might be. 


Alternatives. A number of alternatives have been sug- 
gested to alleviate the limitations of passwords, in- 
cluding inkblots [39], visual recognition [29], client- 
side puzzles [21, 11], interactive challenges [32], word 
labyrinths [6], but any of them has yet to gain much trac- 
tion. 

Multi-factor authentication systems seek not to replace 
passwords, but supplement them with a second or third 
form of authentication, which could be a physical token 
(e.g., SecurID [38]) or a biometric reading. These ap- 
proaches are mostly effective in large organizations. 

Compelling as these sophisticated proposals may be, 
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multi-factor authentication is no panacea, and the var- 
ious mental alternatives to passwords tend to be slow, 
complex, and error-prone, and depend on a particular 
medium or infrastructure. For instance, mental puzzles 
typically require multiple rounds of interaction to gather 
enough entropy, and image recognition tasks will never 
work without a display. Simple portable keypads are 
pretty much out of the question. The usual criticisms 
that have been levelled at passwords, such as low entropy 
and poor cognitive retention, apply to these alternatives 
as well. 


1.2. Application Contexts 


Online Uses. In the online setting, the main use of 
passwords is for remote user authentication. Password- 
based Encrypted Key Exchange (EKE) [5] and Authenti- 
cated Key Exchange (PAKE) [17] protocols enable high- 
entropy session keys to be established between two or 
more parties that hold a low-entropy shared secret, ide- 
ally with mutual authentication. The threat model is the 
online attack, conducted by an opponent who can ob- 
serve and corrupt the lines of communication, and some- 
times also the transient state of a subset of the partici- 
pants, but without access to the long-term storage where 
the password data are kept. 

What makes the online setting favorable for pass word- 
based authentication, is that participants can detect (in 
zero knowledge) when an incorrect password is used, and 
terminate the protocol without leaking information. The 
attacker can always run a fresh instance of the protocol 
for every candidate password, but many EKE and PAKE 
protocols [20, 4, 8] achieve theoretically optimal secu- 
rity by ensuring that no adversary can do better than this. 
Online guessing is easy to detect in practice, and can be 
defeated by locking out accounts with repeated failures. 
Dealing with passwords in the pure online setting is in 
that respect a mostly solved problem, and is the topic 
of the ongoing IEEE 1363.2 standardization effort [19]. 
We will not discuss online passwords further. 


Offline Uses. In the offline setting, passwords are 
mainly used for login and to encrypt data at rest in lo- 
cal storage. Typical applications of password-based en- 
cryption range from user-level encryption of PGP or 
S/MIME private keys, to kernel-level enforcement of ac- 
cess permissions, to hardware-level encryption of a lap- 
top’s hard disk by a security chip or by the drive itself. 
Despite their limitations, passwords tend to be prefer- 
able to other types of credentials. Physical tokens able 
to store large cryptographic keys are susceptible to theft 
along with the laptop they are supposed to protect. Bio- 
metrics are inherently noisy and must trade security for 
reliability; they are also tied to a specific user and cannot 


be revoked. Visual and other alternatives to passwords 
are often complex and too demanding for low-level op- 
eration or in embedded systems; at any rate they do not 
have clear security benefits over passwords. 

The main threat faced by password-based encryption 
is the offline dictionary attack. Unlike the online guess- 
ing discussed earlier, in an offline attack the adversary 
has access to the complete ciphertext and all relevant 
information kept in storage—except the password—and 
does not need the cooperation of remote parties to carry 
out the attack. Tamper-resistant hardware may compli- 
cate ciphertext acquisition, but, past that point, the adver- 
sary is bound only by sheer computational power: this is 
what makes low-entropy passwords so much more dam- 
aging offline than online. 


1.3. Password-based Encryption 


Aside from the peril of dictionary attacks, passwords 
are not usable natively as encryption keys, because they 
are not properly distributed. Key Derivation Functions 
(KDF) let us solve this. 


Key Derivation. The goal is to create a uniform and 
reproducible key from a password. The universally ac- 
cepted practice is to mangle the password through a hash 
function a number of times, after blending it with ran- 
dom data called salt that is made public. The many hash 
iterations serve to make offline dictionary attacks slower, 
and the salt is to preclude using lookup tables as a short- 
cut [18, 30, 3]. Virtually all KDFs follow this model; 
however, it is not a panacea. 

For ones, referring to the apparent futility of prevent- 
ing (targeted) dictionary attacks, in the full version of 
their recent CRYPTO ’06 paper, Canetti, Halevi, and 
Steiner [9] lament: 


[...] typical applications use a key-derivation- 
function such as SHA1 repeated a few thou- 
sand times to derive the key from the password, 
in the hope of slowing down off-line dictionary 
attacks. [...] Although helpful, this approach is 
limited, as it entails an eternal cat-and-mouse 
chase where the number of iterations of SHA1 
continuously increases to match the increasing 
computing powers of potential attackers. 


Instead, these authors propose to treat the password as 
a path in a maze of CAPTCHAS [42], whose (secret) 
answers will provide the key. Alas, such augmented- 
password schemes tend to be unwieldy; here, gigabytes 
of CAPTCHAs must be pre-generated, and then retrieved 
in secret, which relegates it to local storage (lest an of- 
fline dictionary attack on the access pattern reveal the 
password). 
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In general, while it is true that secrets with visual or 
interactive components are likely to hamper mechani- 
cal enumeration, old-fashioned passwords will remain 
faster, less conspicuous, and much more convenient for 
humans to handle and recall. Still, the problem remains 
to design a good KDF. 


Iteration Count. To perceive the difficulty of KDF 
design, recall that Unix’ crypt() hashing for 
/etc/passwd back in the seventies took a quarter of a 
second [33] to perform two dozen iterations of the DES 
cipher (with salt). The original PKCS#5 key derivation 
standard from the early nineties [37] was content to use 
a “positive number” of applications of MD2 or MD5, 
but has since been updated [22] to recommend “at least 
1000” iterations of MD5 or SHA1. This recommenda- 
tion has been followed in the recent and well-regarded 
TrueCrypt software [40], albeit perhaps on the edge, 
with merely 2000 iterations of SHA! or RIPEMD160, 
or 1000 iterations of WHIRLPOOL. Unfortunately, these 
numbers are set in stone in the TrueCrypt source code. 

The custom “s2k” (string-to-key) function of 
GnuPG [15] is preset to hash a total of 65536 bytes 
based on the password, which amounts to a few thou- 
sand iterations of SHA1. Sadly, this number is once 
again hardcoded without user override. At least, the 
OPENPGP [7] format offers some flexibility in that 
regard, and GnuPG can be recompiled to hash up 
to a maximum of 65011712 bytes, without breaking 
compatibility with the official version. Still, even that 
ostensibly large number appears pathetic by today’s 
standards, as it takes only two seconds to digest those 65 
million bytes on a 1.5 GHz laptop circa 2005. 


1.4 The Problem, and Our Solution 


The balancing act in KDF design is to choose a large 
enough iteration count to frustrate a dictionary attack, 
but not so large as to inconvenience the user. Any choice 
made today is likely to prove wholly inadequate a few 
years from now. Furthermore, this assessment should be 
made in view of the lifespan and sensitivity of the plain- 
text, as well as the estimated strength of the password— 
two crucial tidbits of which only the actual user (and not 
the system designer) is privy. 


Security Maximization and User Programmability. 
Given the constraints, the primary goal is to maximize— 
by technical means—the “gap” between user inconve- 
nience and the costs inflicted on attackers. Secondarily, 
it is crucial—for policy and deeper reasons—that users 
be free to vary the (secret) level of inconvenience they 
are willing to accept on a case-by-case basis. In essence, 
we: 


(i) let the user choose the amount of work he 
or she deems appropriate for the task, 

(ii) keep the choice secret from attackers (and 
allow the user to forget it too), 

(iii) and ensure that all user-side computing 
power can be exploited. 


We emphasize again that human-selected passwords tend 
to be by far the weakest link in a typical cryptographic 
chain [26], which is why we seek to squeeze as much 
security from them as we can. 


“Halting” Key Derivation Functions. HKDFs are the 
practical embodiment of all the above requirements. 
They consist of two algorithms, Prepare and Extract. 
The principle is as follows: 


e To create a random encryption key, the user 
launches a randomized algorithm HKDF.Prepare 
on the password, lets it crunch for a while, and inter- 
rupts it manually using the user interface, to obtain 
an encryption key along with some public string to 
be stored with the ciphertext. 


To recover the same key subsequently, the user ap- 
plies a deterministic algorithm HKDF.Extract on 
the password and the public string from the first 
phase. The algorithm halts spontaneously when it 
recognizes that it has recovered the correct key, bar- 
ring which it can be reset manually. 


Thus, if the user entered the correct password, 
HKDF.Extract will halt and output the correct key af- 
ter roughly the same amount of time as the user had let 
HKDF.Prepare run in the setup phase. However, if the 
user entered a wrong password, at some point he or she 
will find that it is taking too long and will have the option 
to stop the process manually in order to try again. 

Notice that the public string causes the derived key to 
be a randomized function of the password, and thus also 
plays the role of “salt”. HKDFs can be used as drop-in 
substitutes for regular KDFs, pending addition in the user 
interface of a button for interrupting the computation in 
progress. 


HKDF Ramifications. The above idea is as simple as 
it is powerful, though surprisingly it has not been inves- 
tigated or implemented before. Ramifications are deep, 
however: 


1. (Stronger crypto) Two extra bits of security can be 
reclaimed from any password. 


A paradoxical result that we prove in this paper 
is that, if the attacker does not know the iteration 
count, and is then compelled to use a “dovetail” 
search strategy with many restarts, then the attack 





USENIX Association 


16th USENIX Security Symposium 


121 


effort is multiplied by ~ 4x (a 2-bit security gain), 
at no cost to the user. 


Intuitively, our design will force any game-theoretic 
optimal brute-force attacker to overshoot the true it- 
eration count when trying out wrong passwords. By 
contrast, when the user enters the correct password, 
the key derivation process will be halted as soon 
as the programmed number of iterations is reached 
(using some mechanism for detecting that this is the 
case). 


. (Flexible policies) Long-term memorable pass- 


words for key recovery become a possibility. 


Sophisticated users should be able to choose any 
password that they will remember in the long term, 
even with low entropy, as long as they are used with 
a large enough iteration count to keep brute-force 
attackers at bay (at the cost of slowing down legiti- 
mate uses correspondingly). 


This opens the possibility of using multiple pass- 
words of reciprocal strength and memorability: one 
high-entropy password with a small iteration count 
for fast everyday use; and a second, much more 
memorable password for the long term, protected by 
a very large iteration count, to be used as a backup 
if the primary password is forgotten. 


. (Future proofing) Password holders automatically 


keep pace with password crackers. 


Indeed, if every time a user’s password is changed, 
the iteration count is selected to take some given 
amount of time on the user’s machine, then the it- 
eration count will automatically increase with any 
hardware speed improvement. This will negate all 
advantage that a brute-force attacker might gain 
from computers becoming faster, if we make the 
natural assumption that technological progress ben- 
efits password verifiers at the same rate as password 
crackers. 


. (Resource maximization) User-side parallelism is 


exploited to raise the cost of attacks. 


Users care about (real) elapsed time; attackers about 
cumulative CPU time. Independently of the idea of 
hiding the iteration count, we design the key deriva- 
tion to be parallelizable even for a single key. With 
the popularization of multi-core PCs, users will then 
be able to increase the total cost of key derivation 
without increasing the observed elapsed time that 
matters to them. The heightened total cost is how- 
ever borne in full by the adversary, who gains noth- 
ing by parallelizing “within” single passwords, as 
opposed to “across” several ones. 


These benefits are complementary rather than indepen- 
dent: for example, by accentuating the iteration unpre- 
dictability, Properties 2 and 3 solidify the Property | se- 
curity gains that ride on it. Property 4 is orthogonal, but 
is equally crucial to our goal of making attacks maxi- 
mally expensive. 


User Acceptance. Aside from the technical arguments 
we develop in the remaining of this paper, remains the 
question of user acceptance. Although we cannot answer 
this question in the name of others, it seems reasonable 
to assume that acceptance should be easy. 

The general principle of using deliberately expensive 
cryptography in conjunction with passwords has become 
standard, and is expected by users. The main commercial 
operating systems even use login screens that frustrate 
casual password guessing “by hand” using fake delays. 
Although this theatre provides but illusory protection 
against true offline attacks, it eloquently demonstrates 
that users (or system provisioners) demand that penal- 
ties be assessed for entering bad passwords. HKDFs ful- 
fil these expectations in a cryptographically sound way, 
but in stark contrast to those commercial approaches, 
HKDFs seek to empower users without burdening them, 
for their benefit. 


1.5 Related Work 


The first deliberate use of expensive cryptographic oper- 
ations to slow down brute-force attacks, in the crypt () 
password hashing function on Unix systems, coincides 
with the public availability of the DES cipher. Since 
then, a lot of progress has been made. 

Provos and Maziéres [33] have proposed a cost- 
parameterizable alternative to Unix crypt (), called 
berypt (), to avoid the obsolescence problems asso- 
ciated with fixed iteration counts. In their proposal, the 
cost parameter is set by the system administrator, shared 
among users, and must be committed to storage (rather 
than kept secret, set arbitrarily, and easy to program us- 
ing the user interface, in the present work). More re- 
cently, Halderman et al. [16] proposed the idea of mak- 
ing key derivation very slow the first time, and subse- 
quently faster by caching some state on the user ma- 
chine: this is mostly useful for client-server authentica- 
tion when the password is so weak that online trial-and- 
error is the greater concern, seconded by cache exposure. 
Interestingly, online PAKE protocols [27] have recently 
started to take offline dictionary attacks into considera- 
tion, by avoiding keeping user passwords in the clear 
on the server, and by distributing these servers among 
several locations. Other approaches to password man- 
agement seek to prevent dictionary attacks in specific 
contexts: the PwdHash [36] system is a browser plug-in 
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that generates reproducible unique passwords for differ- 
ent web sites, and offers some resistance to both online 
and offline attacks. 

Deliberately expensive cryptography has also been 
applied in “proof-of-work” schemes for combatting 
junk email [14] as well as for carrying out micro- 
payments [2], among other similar applications. These 
CPU-bound constructions are based on easy-to-verify 
but hard(er)-to-compute answers to random challenges 
built from hash functions. Memory-bound proof-of- 
work schemes have also been proposed [13], motivated 
not by the desire to prevent parallelism, but rather by the 
observation that memory chips have narrower and more 
predictable speed ranges than CPUs. At the other ex- 
treme of this spectrum, time-lock puzzles [35] are en- 
cryption schemes designed to be decryptable, without a 
key, after a well-defined but very long computation; these 
schemes are based on algebraic techniques, and view the 
publicity of the decryption delay as a feature [28]. 

Regarding parallel hashing schemes, we mention 
Split MAC [41], which is a parallelizable version of 
HMAC [24] for hashing long messages (rather than a 
long loop from a short password). On the cryptana- 
lytic side, we mention Hellman’s [18] classic time/space 
trade-off attack against deterministic password hashing, 
and its modern reincarnation as Oeschlin’s [30] rainbow 
tables. See also [3] for a theoretical study of these types 
of algorithms. 


Contribution. The point of this paper is as much to 
study HKDFs for their own sake as a new cryptographic 
and security tool, as it is to advocate their deployment in 
all practical systems that do password-based encryption. 

In Section 2 we define HKDFs, construct them gener- 
ically, and prove their basic security. We also parame- 
terize them for the long term, and discuss user-side par- 
allelism. In Section 3 we adopt a theoretical stance and 
study the origin of the ~ 4x security factor that seems 
to arise magically. 

In Section 4 we put on a systems hat and show how to 
integrate HKDFs in popular software such as TrueCrypt 
and GnuPG. We plan to release our implementations as 
open-source C code. 


2 HKODF Design 


The guiding design principles of Halting Key Derivation 
Functions are the following: 


1. the cost of key derivation is programmed by the user 
and has no prior upper bound; 

2. the amount of work for each key is independent and 
secret; 


3. the key derivation memory footprint grows in lock- 
step with computation time; 

4. the computation for deriving a single key can be dis- 
tributed if needed. 


We have already mentioned the motivation for (1.) letting 
the user program the iteration count ¢ arbitrarily, and (4.) 
providing user-side parallelism. The justification for (2.) 
keeping ¢ a secret, and (3.) having the memory footprint 
grow linearly with ¢, are to force the attacker to make 
costly guesses as it tries out wrong candidate passwords 
from its dictionary D. 

Suppose the adversary is certain the true password w 
belongs in D, but has no idea about ¢t. The obvious ap- 
proach is to try out all the words in D, in parallel, for 
as many iterations as needed. However, this attack is in- 
credibly memory-consuming since for each word there 
is state to be kept: terabytes or more for mere 40-bit en- 
tropy passwords (#4 D = 24°). 

If the attacker cannot maintain state across all of D 
as the iteration count is increased, the only alternative 
is to fix an upper bound ¢ for ¢ and try each word for ¢ 
iterations, and then start over with a bigger ¢. Clearly, 
this is more expensive since much of the computations is 
being redone. How much more expensive depends on the 
schedule for increasing f. Increase it too slowly, e.g., £ = 
1,2,3,..., and most of the work ends up being redone. 
Increase it too fast, e.g., ¢ = 1!,2!,3!,..., and the true 
value of t risks being overshot by a wide margin. 

We shall see that with the optimal strategy the attacker 
can keep the cost as low as ~ 4x as much as if ¢ has 
been public. The user does not pay this penalty since 
on the correct password the HKDF halts spontaneously 
at the correct iteration count t (which the user need not 
recall either). This gives us ~ 2 bits of extra security 
essentially for free. 

The memory footprint growth in O(t) is a technicality 
to ensure that the argument holds for arbitrarily large t, 
lest it become more economical beyond some threshold 
to purchase the memory. 


2.1. Formal Specification 


As briefly outlined in Section 1.4, an HKDF consists of 
a pair of deterministic functions: 


Prepare : (w,r,t)'+ v_ which, given a password w, a 
random string r, and an iteration count ¢, produces 
a public verification string v; 


Extract : (w,v)'+ k which, given a password w and 
a verification string v, outputs a key k upon halting, 
or fails to halt in polynomial time. 


In this abstract model, the iteraction count parameter t is 
given to Prepare at the onset. In practice, the user sets t 
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implicitly by interrupting the computation as she pleases, 
using the user interface. 


Security Model. We write [a] and [a | }] to denote 
marginal and conditional distributions of random vari- 
ables. Let Ug denote the uniform distribution over a set 
S, often implicit from context. 

Pick r €; {0,1}*, viz., so that [r] = Uro,1}2, to be 
our -bit random seed for some parameter ¢. We first de- 
mand that the extracted keys be uniform and statistically 
independent of the secrets: 


— Key uniformity: [A |w,t] = U where k = 


Extract(w, Prepare(w,r,t)). 


We also impose lower and upper computational com- 
plexity bounds on the functions: 


— Preparation complexity: Prepare(w,7r,t) always 
halts in time O(t), for all inputs. 


— Extraction complexity: Extract(w, v) requires time 
and space Q(t), for all v = Prepare(w,r, t). 


— Conditional halting: Extract(w’, v) does not halt in 
polynomial time when w’ 4 w. 


We then ask that the key be unknowable without the req- 
uisite effort, even with all the data: 
o(t 
— Bounded indistinguishability: [v, & | w, t] 2 U for 

vu = Prepare(w,r,t) and & = Extract(w, v). 
Le., for any randomized algorithm running in space 
(and hence time) strictly sub-linear in ¢, the joint 
[v, &] is computationally indistinguishable from ran- 
dom even given w and/or t. 


As a consequence of the latter, the public string v is 
computationally indistinguishable from random to any- 
one who has not also guessed (and tested for ¢ iterations) 
the correct password against it. 

To summarize, for random r, it must be in- 
feasible to find, in polynomial time in the secu- 
rity parameter, a tuple (k,t,w,w’) such that k = 
Extract(w’,Prepare(w,r,t)) and w 4 w’.  Fur- 
thermore, finding a tuple (k,t,w) such that k = 
Extract(w, Prepare(w,r,t)) must require O(t) units of 
time and memory, barring which no information about 
the correct k must be obtained from w, r, t. 


2.2 Generic Construction 


There are many ways to realize HKDFs, depending on 
the computational assumptions we make. One of the sim- 
plest constructions is generic and is based on some cryp- 
tographic hash function H : {0,1}? — {0,1}* viewed 
as arandom oracle, for a security parameter @. 


To capture the main idea, we start with a sequential 
HKDF construction. The construction is: 


HKDF” .Prepare(w, r, t) 


Inputs: password w, random string r, iteration count t 
(may be implicit from user interrupt). 
Output: verification string v (and corresponding key k). 


// init z from 
password and 


1. z— H(w,r) 


: seed 

2. FOR? :=1,...,¢ or until interrupted // 

3 Yi — 2 // store z in array 
element y; 

4 REPEAT gq times I 

By jg—14+(¢ mod t) // map z to some 
je {1,..,4} 

6 Zz H(z,y;) // update z 

7. v— (A(y1,2z),7) H 

8. k-— H(z,r) / 


HKDF” Extract(w, v) 
Inputs: password w, verification string v. 


Output: derived key k, or may never halt. 


0. parse v as (h, 1) 


// comparison and 
seed strings 


condition 


1. z-— H(w,r) H 

2. FORi:=1,...,00 /I forever loop 

3, Yi — Zz // 

4. REPEAT gq times I 

3: j—1+(z mod?) H 

6. z<— H(z,y;) M/ 

7. IF H(y1,z) = h THEN BREAK // break on halting 
8. 


k — H(z,r) I 


The constant qg is a parameter that determines the ra- 
tio between the time and space requirements. Since the 
Extract function may not halt spontaneously, it must be 
resettable by the user interface. 


2.3. Security Properties 


It is easy to see that the key output by Prepare is random 
and correctly reproducible by Extract. As for the HKDF 
security properties, we state the following lemmas. 


Lemma 1. Key uniformity: [k | w,t] = Uso j;e where 
k = Extract(w, Prepare(w, r,t)). 


Lemma 2. Preparation complexity: Prepare(w, r,t) 
halts in time O(q¢) on all inputs, for fixed g. 


Lemma 3. Extraction complexity: Extract(w, v) halts 
in time O(qt) and uses O(t) bits of memory, for any v = 
Prepare(w, r,t) with same w. 
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Proofs. Since r is random and H is a random function, 
k = H(z,r) is uniformly distributed Vz, which estab- 
lishes Lemma 1. Lemmas 2 and 3 follow by inspection 
of the algorithms. 














Lemma 4. Conditional halting: Except with negligi- 
ble probability, Extract(w’, v) halts in super-polynomial 
time (2° q) for any v = Prepare(w, r,t) and w’ 4 w, 
where the probability is taken over the random choice of 
ff for arbitrary inputs. 


Proof. For w’ # w, the value of y; = H(w,r) in 
Prepare and y, = H(w’,r) in Extract will be statis- 
tically independent since H is a random function, and 
therefore so will be the benchmark h = H(y,, z) and its 
comparison value H (yj, z’) for all z’. Since the constant 
h, the variable z’, and the value H(y}, 2’), are all ¢-bit 
binary strings, we find that, letting 2 — oo, 


Pr(Extract loops indefinitely) = e~! ~ 0.3678794 , 


Pr(Extract halts before count 7) = (1 — e~*) (1 — et/2*) . 


The probability of halting on the wrong password in sub- 
exponential time i < 2° is negligible. 














Lemma 5. Bounded indistinguishability: the distribu- 
tions [v,k | w,t] and Uso 4}3¢ are perfectly indistin- 
guishable by any algorithm running in sub-linear time 
and/or space o(t) in the iteration count, for any v = 
Prepare(w,7r,t) and & = Extract(w, v). 


Informal proof sketch. We deal with the time-bound in- 
distinguishability claim first. Observe that both v and k 
are independent ouputs of a chain of qt applications of 
1, seeded by r. Since H is a random oracle, a standard 
argument shows that no information about (v, k) can be 
obtained without qt queries to H, which establishes the 
time-bound indistinguishability claim. 

For the stronger space-bound indistinguishability 
claim, a more subtle argument shows that, with over- 
whelming probability, all possible computation paths re- 
quire that “almost all” y; for 2 = 1,...,t be stored 
in memory. The argument is based on the following 
sequence of observations: (1) For all i € {1,...,¢} 
and all 7’ € {i,...,t}, the value y; computed at step 
i will be needed at a subsequent step 7’ with proba- 
bility Pr(y; needed at step i’) = q/i’, independently of 
its prior uses. (2) The expected number of times that 
y; will be needed in the course of the entire computa- 
tion is #{7’ : y; needed at step i’} = wai @/?) x 
q In(t/i), which is > nq for any n > O andi < e~"t. 
(3) The probability that for fixed 7 < e~”¢ the value 
y; is never needed is Pr(y; not needed) < e~"%, which 
whenever n > €/(q1n2) is a vanishingly small func- 
tion of the effective security parameter ¢. (4) Since, for 


such n, the difference e~"t — e~”"~'¢ is a linear func- 
tion of t, the sub-linear memory constraint requires that 
some y; with 7 < e~"~'t be dropped prior to reaching 
the [e~”¢]-th step. (5) With overwhelming probability 
Pr > 1—e~"4, the dropped value y; appears in the 
computation path of some y; where 7 < e "t < 2, and 
without the value of y; the key derivation cannot pro- 
ceed. 

The outcome of this reasoning is that before we can 
compute y;, we need to recompute the dropped value y;, 
which itself requires the recomputation of some earlier 
values still: some of these values must also have been 
dropped, as the same reasoning shows using an incre- 
mented n <— n+ 1 (with recursion upper bound | In ¢]). 
To complete the argument, we note that for some / where 
j <l<i< t, the recomputation of y; needed for y; 
will require freeing up some previously stored value y;, 
which is still needed for the calculation of y;, and whose 
recomputation will require y;; when this happens, the al- 
gorithm will be stuck. This shows that the intrinsic space 
complexity of computing HKDF” Extract by whatever 
means in the random oracle model is O(£t). 














A consequence of Lemma 5 is that, unless the attacker 
has an enormous and linearly increasing amount of mem- 
ory at its disposal, it will not be able to mount a “per- 
sistent” attack against all D (or any significant fraction 
thereof). It will have to choose which bits of state must 
be kept, and which ones must be erased to make room 
for others: the attack will necessarily be “forgetful”. 


2.4 Parallelizable Construction 


In addition to allowing arbitrarily large ¢ and forcing the 
adversary to guess it, a complementary way to increase 
the adversary’s workload is to exploit any parallelism 
that is available to the user. Indeed, users care about 
the real elapsed time for processing a single password, 
whereas attackers worry about the total CPU time needed 
to cycle through the entire dictionary. Hence, we can 
hurt the adversary by increasing the CPU-time/elapsed- 
time ratio, with parallelizable key derivation. 

Interestingly, we note that this runs contrary to con- 
ventional wisdom on password hashing, which tradi- 
tionally abhors parallelism. The reason why our new 
password-level parallelism is safe is that only the legit- 
imate user can benefit from it. The adversary is always 
better off using the cruder kind of dictionary-level paral- 
lelism that has always been available to him. 

We require a cryptographic hash function H 
{0,1}* — {0,1}¢ where @ is a security parameter. Let 
{STATEMENT(/)})_ ___,, denote the p independent state- 
ments STATEMENT(1), ..., STATEMENT(p), where p is a 
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“maximum parallelism” parameter. Our generic paral- 
lelizable HKDF is as follows: 


pHKDF” .Prepare(w, r, t) 


Inputs: password w, random string r, iteration count t 
(may be implicit from user interrupt). 
Output: verification string v (and corresponding key k). 


1. {zi <2 A(w,r,!) fica, p // init each 2; 

: independently 

2. z— A(2,..., 21) // init z from all 
the z; 

3. FOR? :=1,...,¢ or until interrupted  // 

4. Yi — 2 // store z in array 
element y; 

5. REPEAT q times / 

6. {j, — 1+ (4% mod Oe // map each z; to 

: some jz € 
{1, ay a} 
7. {zi — A (2s Vj) a4 - // update each 2; 
— independently 
8. z— H(z,..., 21) /! update z 
9. v— (A(y,2),7) i 
10. k — H(z,r) i 
pHKDF” .Extract(w, v) 
Inputs: password w, verification string v. 
Output: derived key &, or may never halt. 
0. parse v as (h,1r) i 
l. {4 — H(u,r,l)})_, iI p-way 
HT.) . 
parallelizable 

2. z2— H(21,..., 2) i 

3. FOR? :=1,...,00 I 

4. Yi — Zz // 

5 REPEAT q times // p-way 
parallelizable 
across whole 
loop 

6. {ji — 1+ (z% mod #)})_, yp |! p-way 

a parallelizable 

7 {zi — A(z, yi bint... // p-way 

: parallelizable 
8. z<— Ho(21,..., 21) 
9. IF H(y1,z) =hTHEN BREAK // 
10. k — H(z,r) i 


The constant p determines the maximum parallelizabil- 
ity of the scheme: it can then vary from 1-fold to p-fold 
without significant overhead. Total computational cost 
is O(pqt) hash evaluations. Total memory requirement 
is O(p + t) hash values, including a constant ¢p bits of 
memory overhead compared to the basic construction. 
Complexity-wise, the parameter p acts as a multiplier on 
the space/time proportionality ratio g, so that all secu- 
rity properties are retained with pq instead of q. It is thus 


easy to enable parallelism by increasing p and decreasing 
q proportionately. 

The relative penalty exerted on the adversary will be 
proportional to the number NV of CPUs that the user can 
bring to bear, under the constraint that V < p (and where 
ideally, N | p). 


Partitioned Memory. The sequential scheme of Sec- 
tion 2.2 can also be made p-wise parallelizable for p = 
2', by dropping / bits from r when Prepare-ing the pub- 
lic string v = (h,r). To re-derive the key, the user tries 
all completions of r by running p instances of Extract at 
once until one halts. With p machines, the elapsed time 
is unchanged; however the total work is O(pqt). The in- 
convenient is that this requires O(pt) memory instead of 
O(p +t) for the method of Section 2.4, but the advan- 
tage is that processing and memory can be partitioned 
over p independent machines. Applying the same trick 
to the Section 2.4 scheme, gives us a hybrid with two 
parallelism options. 


2.5 Practical Parameters 


HKDF parameter selection is non-critical and much eas- 
ier than with regular KDFs, since we are not trying to 
make decisions for the user, or prevent obsolescence by 
betting pro or con Moore’s law. The only choices we 
need to make concern the coefficients p and qg. The rule 
of thumb is: maximize pq in view of today’s machines, 
and then fix p to cover all foreseeable needs for paral- 
lelism. 

For the sake of illustration, let 2 = 256, and suppose 
that that the user’s key derivation hardware can compute 
n = 27° hashes per second (e.g., with 2° cores each ca- 
pable of 2?? hashes per second), and suppose the device 
has m = 27! . 256 bits = 64 MiB of shared memory. 
Memory capacity will be reached after T = mpq/én 
seconds of elapsed computation time. Thus, if we aim 
for pq = 27°, the maximum selectable processing time 
on the device will be 2!° seconds (close to 1 day), in in- 
crements of 2~° second. We can take p = 219 .3?.5? = 
230.400 and hence q = 4 to get pq = 27°. Last, we as- 
certain that, per all these choices, the available memory 
is still much larger than the 2p ~ 7 MiB of overhead that 
are the price to pay for the parallelization option. 

Suppose then that the user settles for t = 2° iterations 
(to take 1 second on the current device), and chooses a 
weak password with only 40 bits of entropy (from an im- 
plicit dictionary of size d = 24°). In these conditions, an 
adversary will need (td = 2°° bits = 1024 TiB of mem- 
ory in order to conduct a persistent attack. On a faster 
and/or more highly parallelized device, the user would 
choose a correspondingly larger value of t, further in- 
creasing the load on the adversary. 
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Flexible Parallelism. It is advisable to set p as a large 
product of small factors, to facilitate the even distribu- 
tion of workload among any number N of CPUs such 
that NV divides p; this is easy to achieve in practice since 
the values of pq tend to be quite large, on the order of 
pq = 1000000. A nice consequence is that the same 
HKDF can be dimensioned to accommodate any reason- 
ably foreseen amount of user-side parallelism (hence the 
choice p = 2!0. 37-5? = 230 400), and still be usable on 
today’s sequential computers (with at least 2p ~ 7 MiB 
of memory in this example). 


3. The Security Gap 


We show that any adversary lacking enormous amounts 
of memory will incur a ~ 4x larger cost for not know- 
ing the iteration count. Since the penalty only strikes on 
wrong guesses, the user who knows the correct password 
will be immune to it. We say that HKDF's widen the “se- 
curity gap”’. 


3.1 Offline Dictionary Attack Model 


We consider the simplest and most general offline attack 
by an adversary A against a challenger C. We capture 
the password “guessability” by supposing that it is drawn 
uniformly at random from a known dictionary D, and 
define its entropy as the value log,(#D). The game is 
as follows: 


Challenge. The challenger C picks w €,; D 
and r €, {0,1} at random, chooses 
t € N, and computes (v,k) <— 
HKDF.Prepare(w,r,t). It gives the 
string v to A. 


Attack. The adversary A outputs as 
many keys as it pleases, sequen- 
tially: ky, ke, . It wins the 


game aS soon aS some ky matches 
k = HKDF.Extract(w, v). 


We assume that A can only retain state for a dwindling 
fraction of D, of size o(1) in t. 


Password (Min-)Entropy. In reality, passwords are 
not sampled uniformly from a fixed D, but rather non- 
uniformly from a set with no clear boundaries. The 
worst-case unpredictability of a password chosen in this 
manner is the minimum entropy, or min-entropy, defined 
as — log,(max, Pr(w)). The uniform password model 
w €s, D conveniently and accurately reflects the diffi- 
culty of guessing from C’s true password distribution, 
provided that log,(#D) matches the min-entropy of the 
latter. 


3.2 Finding the Optimal Attack Strategy 


By Lemma 5 we know that A cannot do better than 
outputting random keys until it “tries out” the correct 
password w for ¢ iterations (using the Extract function). 
Since A lacks the memory to maintain concurrent in- 
stances of Extract for any substantial subset of D, the 
only option is to “dovetail” the search, i.e.: 


— try all the words of D one by one (or few by few) 
for a bounded stretch of time; 

— retry the same for longer and longer time stretches, 
until ¢ is eventually exceeded. 


We can neglect the o(1) fraction of D on which A could 
run a persistent attack. Also, for uniform w €, D and 
unknown ¢ it is easy to show that it is optimal to spend 
the same amount of effort on each candidate password. 
We deduce that the optimal algorithm for any forgetful 
attacker A is: 


Optimal-MemoryBound-Ap(v) 


Input: verification string v. 
Output: password w and key k. 


1. FOR ¢ := ti, ta, ... Mty<ta<.. 

: the search 
7 schedule 

2; FOR w € D // in sequence 
or partially 
parallel 

3. RUN & — Extract(w, v) for t steps 

4, IF k € {0,1}¢ THEN /I did Extract 
halt sponta- 
neously? 

5. RETURN (wt, k) 7] 


The only parameters to be specified are the increasing 
sequence of iteration counts t) < t2 < ...; the optimal 
schedule (t1, tz, ...) will depend on A’s uncertainty on t. 


Effort and Penalty. We now quantify the total com- 
putation effort expended by A in function of the attack 
schedule (t1,t2,...). Let us denote by Wf’, (t) the 
total expected number of hash evaluations made by A is 
the iteration count chosen by C is t. Let k be the smallest 
index such that t, > t. Let d = #D, and define the con- 
stant u = dq. Since all of D will be explored for each 
t; < t, and only half of D on expectation for the first 
t, => t (and nothing thereafter), we find that: 


k = min{i:t; >t} 
where 
u=(#D)q 
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If, on the other hand, A had known the value of ¢ and just 
had to search for the password alone, the expected attack 
effort, denoted W(t), would have been: 


(3) 


We define the penalty (of not knowing 1) as the ratio: 


Wit to,... (6) 2(tit...tth—i)+t 
m(t) = Wet) — (ta : a) +t. = 1. Next, we 


show how to bound z(t). 


w(t) = with u=(#D)q. 





3.3. Bounding the Uncertainty Penalty 


First, we should clarify that the goal of A is to minimize 
the value of z(t) on expectation over the random choices 
made by A and C, and not necessarily in the extremal 
cases where ¢ is either very small or very large. Indeed, 
it is not in the interest of C to choose too small a value 
for t. Furthermore, A can easily achieve 7(t) = 1 for the 
maximal value of t (we assume that A knows what hard- 
ware C uses), simply by setting t; = tmax, but this would 
be a Pyrrhic victory since the attack would be utterly pro- 
hibitive and probably for naught. More generally, A can- 
not simply let t; be the largest “likely” value for ¢, since 
then C would figure it out and select t = t; + 1. 

The foregoing strongly suggests that the game- 
theoretic optimum must be scale-invariant over the en- 
tire range {tio, ..., tri} > t that C considers useful. It also 
suggests that A and C should use mixed (i.e., random- 
ized) strategies. We use the notation [V] to denote the 
distribution of V. 


Lemma 6. Uniform equilibrium: There exists a constant 
To, function of tj) and ti, such that a Nash equilibrium 
between A and C can only be reached for a randomized 
attack strategy such that Vt € {tio,...,tn} : a(t) = 70. 
The corresponding optimal strategy for A exists. 


Informal proof sketch. Let [|(t1,t2,...)] be an optimal 
mixture, or distribution of schedules, for A, and sup- 
pose toward a contradiction that for this strategy the 
expected penalty z(t) is not uniform over the entire 
range of acceptable values for ¢. Thus, there exist 
teasy aNd thara in the interval {tio,..., tri} such that Vt : 
T(teasy) < 1(t) < W(thara). Since the mixture is optimal, 
C can compute its parameters and select t = thara to ex- 
ert the stiffest expected penalty on A. Predicting this, A 
would let p = thara/ teasy and switch to a new mixture 
given by: [(¢4, t4, ...)] = [(ets, pte, ...)]. 

It is easy to see that 7’(t) = 7’(thara) under the new 
mixture equals 7(teasy) < m(t) under the original one. It 
follows that the new strategy performs better than the old 
one when C consistently chooses t = thara (which was C’s 
optimal defense in response to A’s supposedly optimal 


attack). It follows that the original strategy was not opti- 
mal after all, and we conclude that any optimal random- 
ized attack must incur the same penalty 7 = 1(t) for all 
t € {tho, ..., thi}, as claimed. Existence of the random- 
ized strategy characterized above follow from Nash. 














Lemma 7. Scale invariance: In the limit (thi/tto) 
oo, the optimal attack and defense strategies are scale- 
invariant. For A the optimal ratio [(t;41/t;)] converges 
in distribution to a mixture [a] that is independent of i. 
For C the optimal parameter [¢] assumes a Zipf power 
law, whose probability density function Pr(t < 2) is 
proportional to «~° for some negative exponent —( in 
the limit. 


Informal proof sketch. Consider an optimal mixed strat- 
egy [(t1, t2,...)] and an iteration count ¢. Without loss 
of generality, we assume that tp) < t < thi. Fix 
some 6 > 0, and let t’ = (1+6)t. By Lemma 6, 
we know that z(t) = z(t’). Now, consider the 
mixed schedule [(t/,, 4, ...)] obtained by subtituting ¢/ = 
(1+ 6)t; for t; everywhere, while keeping all proba- 
bilities the same. Denote by z’ the penalty function 
under that new schedule. By definition, we have the 
identity x(t’) = 442 x(t) = x(t), and by transitiv- 
ity we obtain that a(t) = 7z’(t). We conclude that 
a(t) = n'(t) for any distribution of ¢ over the interval 
{(1 +) tho, ..., (1 +6)~1 thi}, for any 5 > 0. 

Since the strategy is optimal, it follows that multiply- 
ing all the values in all the schedules it comprises by 
any constant (1 + 6) must preserve 7(t); this also works 
backward for (1 + 6)~+, and thus this is true in the limit 
for any multiplier in R*. In other words an optimal 
strategy for A is invariant to (multiplicative) scaling. A 
straightforward argument then shows that this must be re- 
ciprocated by the optimal response employed by C. Ap- 
proximating ¢ as areal in R*, we deduce that t must obey 
a Zipf power law, whose density is: 4-Pr(t < x) « a 
for some 3 € R. 

For the remaining claim, we first note that the 
scale invariance implies that all the individual sched- 
ules (¢1, t2,...) in the mixture must satisfy (t;41/t;) = 
(t;41/t;) for all 1,7, otherwise the multiplication by 
a constant would result in a different mixture. We 
have not yet ruled out the possibility of (sub-)mixtures 
[(t1, ta, ...)], (44, t4,...)],--. with unequal progressions 
[(tis1/ti)] A [(ti41/t;)], which is why so far we say 
that [(t;41/t;)] converges to a distribution [a] instead of 
a value a*. 





BR 
ol 














Randomized Starting Point. Lemmas 6 and 7 show 
that the optimal attack schedule for A is a randomized 
sequence (t1,t2,...) where t; = t,a°~! for some ran- 
dom starting point t; ~ tj, and a progression coefficient 
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a €s [a]. For large enough t > ty, the penalty becomes: 





bya et tk a+1 

t)= [2 x 
n()= (22 t= Fer, b) toy, 
forsome y= “ € [l,a). 


Applying the scale invariance principle, we know that the 


expected z(t) should be constant for varying ¢, which 


requires that y be distributed with density x y7!: 





d x ‘t/Ina forl<ax2<a 
—Pr(y <2) = = 1 
dx (7 ) fi otherwise y 
Since y has expectation i x = dx = o—t the uni- 
form penalty for all choices of t¢ is thus: 
atl 
a(t) & To = ; (2) 
In(q@) 


Optimal Progression Coefficient. The last thing we 
need is to compute 7 in function of the progression co- 
efficient a, which is drawn from some distribution [a] 
yet to be specified. Notice from Equation (2) that 79 is 
a convex function of a that reaches a minimum for some 
a = a“ € (1,00), hence the optimal [a] is the pointwise 
distribution centered on a*. Asymptotically, the numer- 
ical values of the optimal attack coefficient a* and the 
corresponding minimal penalty 74 are given by: 


. . atl a*+1 
a* = arg min —— 
oer In(@) ? In(a*) ’ 


T = a” & 3.59112147666862 . (3) 


To implement the optimal strategy, a rational attacker 
A would fix a = a* from Equation (3), and start the 
search schedule from some random t; = tio7y where y 
is distributed as in Equation (1). No matter how clev- 
erly C chooses t, the expected penalty incurred by A is 
m(t) = 7 & 3.5911215. (We mention that the same 
constant, 3.59112..., arises in the context of the cow-path 
problem [23], which is a hidden search problem with a 
related structure and also with scale-invariance proper- 
ties.) 

Finally, and reciprocally, we can determine the opti- 
mal Zipf-Pareto exponent —(* that a rational C should 
choose to oppose A. Straightforward calculations show 
that —G* = —1. 





at eae 
To = 


3.4 Justifying the Zipf-Pareto Hypothesis 


We have shown that (for the stated objective of maximiz- 
ing the expected security gap) the optimal distribution 
[t] is a power law of exponent —G* = —1 over some 
fixed and fairly wide interval of interest. The question is 
whether this is a reasonable assumption to make for the 
behavior of C: 


Would a typical user not always program the 
same key derivation delay? 

The first answer to this question depends on the user’s 
psychology, and his or her understanding of the bene- 
fits provided by HKDFs. In fact, it is sufficient that the 
attacker believe that the user has a good reason to use 
very long delays on occasion (e.g., to protect a particu- 
larly sensitive ciphertext, or to shield a long-term backup 
password that will only be used as a last resort, as we 
already discussed in the introduction). Of course, if the 
attacker does not believe such a thing, but the user does 
it anyway, it is the attacker who will be sorry. 

The second answer is a phenomenological one. Zipf 
or Pareto distributions (of law « m7?) have been noted 
to occur ubiquitously in the upper tail of empirical distri- 
butions in a variety of contexts, ranging from physics and 
geology with the distribution of reserves in oil field de- 
posits, to linguistics with the relative frequency of words 
in written texts, to economics regarding distribution of 
income and normalized returns of securities, and even to 
anthropology with the size of human population centers 
(see [34] for a list of these phenomena). Hence, it seems 
natural to assume that for large ensembles of users and/or 
ciphertexts, the induced iteration count ¢t would be akin to 
a Zipf process. This hypothesis draws credence from an 
observed pattern in natural and human sciences [31, 25] 
that the most common empirical distributions are Zipf- 
Pareto of exponent —6* = —(1+e) $-1. 

In summary, we have shown that the HKDF approach 
gives us a small amount of “free” security: 


Theorem 8. Security gain: Under the reasonable hy- 
pothesis that users do not always choose ¢ predictably, 
HKDFs increase the “ effective entropy” of any pass- 
word, over regular KDFs, by: 


logs (m2) & 1.84443445579378 bits . 


4 Real-world Implementation 


We believe that the case is strong for dropping KDFs in 
favor of HKDFs wherever possible, and to make it even 
stronger we discuss two compelling real-world applica- 
tions. 

We present two implementations of HKDFs on 
GNU/Linux systems, which we intend to release as open- 
source portable (POSIX) C code. Our first prototype is 
as a stand-alone command-line tool to be used in con- 
juction with programs such as GNU gpg [15] or Ru- 
usu’s aespipe [1] to assemble strong password-based 
encryption pipelines. Our second prototype is a patch for 
the truecrypt [40] “plausibly deniable” disk encryp- 
tion software, which dramatically increases its resistance 
to offline dictionary attacks, and thus plausible deniabil- 
ity by implication. 
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We will see that HKDFs are much more secure in 
practice than the KDFs they replace, at the cost of little 
tweaks to the UI, minimal impact on the user behavior, 
and no change to the hardware. 


4.1 


TrueCrypt [40] is a password-based disk encryption 
software of modern design, developed for Windows 
and subsequently ported to Linux, and available un- 
der a permissive open-source license. TrueCrypt is 
aimed at local storage encryption underneath the filesys- 
tem. It provides plausible deniability, meaning that a 
truecrypt-encrypted disk should be indistinguishable 
from a shred-ded disk to anyone who lacks the pass- 
word. Free-space ciphertext and plaintext are designed 
to look random: this allows a nested volume to be hid- 
den in a container volume’s free space. This is perhaps 
the central feature of the design, and truecrypt is able 
to avoid clobbering the hidden volume when writing on 
the container, as long as both volumes are mounted. 

The cryptographic design is otherwise fairly standard. 
A password-based KDF-encrypted header holds a ran- 
domly generated key, needed for encrypting the bulk of 
the data (i.e., the disk sectors). One peculiarity is that 
the KDF iteration count cannot be recorded because the 
encrypted volume including the header must appear ran- 
dom, and so it is burned into the truecrypt binary. 


TrueCrypt Disk Encryption 


Plausible Deniability. Had they been available, 
HKDFs would have been very helpful, indeed: 


1. There would be no need to record the iteration count 
anywhere, yet no reason to keep it fixed. 


2. Plausible deniability would be enhanced greatly, 
because it all hinges on the password and the effort 
needed to crack it, and we know that HKDFs make 
that much harder for at least three reasons (arbitrar- 
ily large counts, widened security gap, and user-side 
parallelism). 


Implementation and Interface. TrueCrypt’s KDF 
is PKCS#5 with a_ user-selected hash (SHAI1, 
RIPEMD160, or WHIRLPOOL) and a _ hard-coded 
iteration count (2000, 2000, and 1000 respectively). 
Since the hash selection cannot be recorded in the 
volume any more than the iteration count, truecrypt 
simply tries the three functions in sequence until one 
works. 

We implemented the generic HKDF of Section 2.2, in- 
stantiated with SHA1, as a fourth option to be tried last 
(quite naturally, since it must be allowed to run for an 
arbitrary amount of time). The encrypted volume format 
has space for 64 bytes of PKCS#5 salt; we reclaim 40 


bytes for the HKDF public string v (which is random, see 


Section 2), and pad the rest with random data. 
The two functions we interface with TrueCrypt are: 


//ceturns: actual value of t 
//input: maximum t, 0 for none 
uchar const *w, //input: password w 
uchar const «*r, //input: randomness r 
uchar «v, //output: public string v 
uchar «k); //output: derived key k 


ulong HKDF_prepare ( 
ulong tmax, 
uint w_sz, 
uint r_sz, 
uint v_sz, 
uint kosz, 


//ceturns: actual value of t 
//input: maximum t, 0 for none 
//input: password w 
//input: public string v 
//output: derived key k 


ulong HKDF_derive ( 
ulong tmax, 

uint w_sz, uchar const *«w, 
uint v_sz, uchar const *«v, 
uint k_sz, uchar xk); 


We build a modified version of TrueCrypt, called 
hkdf-tc, that invokes HKDF_prepare() when 
asked to create a new volume with the HKDF option 
turned on, and defers to HKDF_derive() when asked 
to mount a volume with an undecipherable header. Al- 
though both functions take a parameter tmax that could 
play the role of t, the actual selection of ¢ is implicit and 
interactive: 


When creating a new volume, hkdf-tc asks the user 
to enter the same password twice, and to choose a 
number of options. If the HKDF option is selected, 
HKDF_prepare() will invite the user to press 
a key after any—short or long—delay, explaining 
that the same delay will be incurred every time the 
volume is mounted as a defense against password 
guessers. 


When mounting an existing volume, hkdf—tc queries 
the password and tries the built-in KDFs. If these 
fail, HKDF_derive () is invoked, and the user in- 
structed to press a key if it is taking too long, for the 
program cannot distinguish a wrong password from 
one with a longer delay. 


In both cases, computations proceed in the background, 
pending the user signal which is detected by polling a 
non-blocking I/O system call at every iteration of the 
main loop. At ~ 1-30 Hz, this solution is responsive but 
not wasteful, and fits well with TrueCrypt’s command- 
line user interface. 

With a graphical UI, another approach would be to 
add a button to the password entry dialog, greyed out 
at first, and becoming clickable once the user has en- 
tered a password: its label when commissioned by 
HKDF_prepare() would be [finish]; or [cancel] 
when commissioned by HKDF_derive(). One could 
also add a busy indicator, progress bar, or iteration 
counter, to taste. 


4.2 Command-line HKDF Tool 


Our second implementation is a small command-line 
tool, called hkdf, whose usage is as follows: 
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1. hkdf -p [-r|-s] [-t MAX] prompts for a passphrase, and prints a public string 
hkdf -k [-r|-s] [-t MAX] FILE same, but writes publ.str. to FILE and prints the key 

2. hkdf [-d] [-t MAX] reads publ.str., asks for a passphr., and prints a key 
Arguments: -p|-k PREPARE mode --- once running, press the * key to finish 

-—d EXTRACT mode (the default) --- press Control-C to cancel 

=r reads randomness from stdin (instead of /dev/random) 

-s reads passphrase from stdin (instead of user prompt) 

-t MAX triggers auto-finish or auto-cancel at iteration MAX 


Each of the following commands creates a random 
public string v, saves it to the file public. v, and prints 
the corresponding key & on standard output, based on 
the user’s passphrase. The second command asks for the 
passphrase twice (on behalf of hkdf£ —p and hkdf —d, 
in unspecified order), and re-derives k on-the-fly to pro- 
vide end-to-end verification without committing any se- 
cret to disk. 

# hkdf -k public.v 

# hkdf -p | tee public.v | hkdf -d 

The user must press the * key at some time after entering 
the passphrases(s) (or use the —t option) to set the key 
derivation delay. To recover k from public. vata later 
time, we use: 

# hkdf < public.v 

which prompts for the passphrase once. 


Encryption with AESpipe and GnuPG. We can 
combine hkdf with aespipe [1] to assemble a (ran- 
domized) password-based AES encryptor with HKDF 
resistance to dictionary attacks. The plaintext is a file 
plain.bin and the ciphertext will consist of two files 
crypt.vand crypt .aes. To encrypt: 
# aespipe -p 4 4<<<‘*hkdf -k crypt.v* \ 

< plain.bin > crypt.aes 
In the Bourne shell (/bin/sh), the string 4<<<*...°* 
causes the command between the backquotes to be exe- 
cuted in a sub-shell, and its output redirected to the par- 
ent’s unused file descriptor #4; meanwhile, the param- 
eter —p 4 instructs aespipe to fetch its key from the 
same. To decrypt: 
# aespipe -d -p 4 4<<<‘*hkdf < crypt.v* \ 

< crypt.aes > decrypted.bin 
This command works similarly. If the passphrase is 
good, hkdf will feed the right key to aespipe; oth- 
erwise, it will run forever until interrupted by Control-C. 


The hkdf tool is even easier to interface with other 
programs, e.g., gpg [15]: 
# hkdf -k crypt.v | 

-o crypt.gpg -c plain.bin 
# hkdf < crypt.v | gpg --passphrase-fd 0 \ 
-o decrypted.bin crypt.gpg 
This is merely suggestive; more sophisticated scripts 
could merge the ciphertext into a single file. 


gpg --passphrase-fd 0 \ 


GnuPG Key-rings. Since the user passphrase is the 
Achilles’ heel of the system, an excellent use of the 
hkdf/gpg synergy is to replace gpg’s default key- 
ring encryption with something stronger. To quote the 
gpg (1) manual page: 


WARNINGS 

Use a *good* password for your user ac- 
count and a *good* passphrase to protect 
your secret key. This passphrase is the 
weakest part of the whole system. Pro- 
grams to do dictionary attacks on your se- 
cret keyring are very easy to write and so 
you should protect your ’*/.gnupg/” di- 
rectory very well. 


HKDFs are an excellent way to add protection with or 
without changing the passphrase. Our hkdf tool and a 
small script to bind it to gpg are all that is needed. 


4.3 Concrete Security Gains 


We now quantify the security gained by upgrading True- 
Crypt and GnuPG from KDF to HKDF. Our test plat- 
form is a 1.5 GHz single-core x86 laptop running Debian 
Linux. 


Baseline Measurements. First we clock the various 
built-in KDFs to establish the benchmark: cf Table 1. 
These timings were obtained by instrumenting the rel- 
evant sections of code, in order to suppress overheads 
and obtain an accurate indication of the amount of work 
needed for a brute-force attack. 


HKDF Performance. Next, we measure the perfor- 
mance of the HKDF implementation, and the rate at 
which the size of the state is increased: cf: Table 2. As 
we would expect, the raw throughput is very close to but 
slightly less than a “pure” implementation of the corre- 
sponding hash function (e.g., compare the SHA instan- 
tiation with gpg above). The discrepancy is caused by 
the modular reduction in the inner loop of the HKDF al- 
gorithm. 
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Attainable Security Gains. We now find the actual 
key derivation complexity (time and space) for several 
user-programmed delays, and what this entails for an 
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Table 1: Baseline measurements. 

















Software Digest function Normalized Fixed Time per password 
speed multiplier (as measured) 

truecrypt HMAC-SHAI1 25200 #/s 2000 # 79 ms 
HMAC-RIPEMD160 —.20400 #/s 2000 # 98 ms 
HMAC- WHIRLPOOL 9700 #/s 1000 # 101 ms 

gpg MD5 30.0 MB/S 65536 B 2.2 ms 
SHA1 (default) 28.0 MB/s 65536 B 2.3 ms 
SHA256 15.2 MB/s 65536 B 4.3 ms 
SHAS512 9.9 MB/s 65536 B 6.6 ms 





Table 2: HKDF performance. 








HA algorithm Hash width WHKDF throughput Time resolution and Memory rate (@1 CPU) 








for HKDF” L (q = 57600) (q = 230400) 
SHAI 160 25.1 MB/s 11.0Hz 220 B/s 2.8Hz 56B/s 
WHIRLPOOL 512 19.7 MB/s 2.7Hz 173 B/s 0.7Hz 45B/s 





Table 3: Attainable security gains. 


























Program H for HKDF” Time & Memory (per password) : Security gain 
Programmed Adversarial vs. built-in KDF 
hkdf-tc WHIRLPOOL 3 Sec. <1kB 11 sec. 2kB 102x (~ 7 bits) 
(vs. truecrypt) 4min. 41kB 14min. 147kB 104 (~ 13 bits) 
45 min. 0.5 MB 3hours 1.7 MB 10°x (~ 17 bits) 
hkdf/gpg SHAI lsec. <1kB A sec. 1kB 10x (~ 10 bits) 
(vs. gpg) 10min. 131 kB 36min. 469kB 10° x (~ 20 bits) 





2hours 1.6 MB 7hours 5.5 MB 10” x (~ 23 bits) 





Table 4: Attack times. 























Opponent #CPUs GnuPG TrueCrypt HKDF 
1-core 32-core 
40-bit secret 1s 10m 1s lh 
Individual 10: 7.79 275 y 12.5ky 7.0My 401 ky 14Gy 
Corporation 104 67h 101d 13 y 7.5ky 40ly 1.4My 
Huge botnet 107 242s 2.4h (31 hyt 7.5Y (41 di L4Aky 
“The World” 101° 242 ms 8.6 s (110 s)t (18 hyt (59 m)t (147 d)t 





The flagged figures relate to a persistent attack, feasible for these pa- 
rameters if the opponent has 1 GiB per CPU. 





60-bit secret 


Government 10° 80 y 2.9ky 131lky 79 My 4.2My 15 Gy 
“The World” 101° 70h 105d Ly 7.9ky 420 y 1.5My 
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Abstract 


Unsolicited bulk e-mail, or SPAM, is a means to an end. 
For virtually all such messages, the intent is to attract the 
recipient into entering a commercial transaction — typi- 
cally via a linked Web site. While the prodigious infras- 
tructure used to pump out billions of such solicitations is 
essential, the engine driving this process is ultimately the 
“point-of-sale” — the various money-making “scams” 
that extract value from Internet users. In the hopes of 
better understanding the business pressures exerted on 
spammers, this paper focuses squarely on the Internet in- 
frastructure used to host and support such scams. We 
describe an opportunistic measurement technique called 
spamscatter that mines emails in real-time, follows the 
embedded link structure, and automatically clusters the 
destination Web sites using image shingling to capture 
graphical similarity between rendered sites. We have 
implemented this approach on a large real-time spam 
feed (over 1M messages per week) and have identified 
and analyzed over 2,000 distinct scams on 7,000 distinct 
servers. 


1 Introduction 


Few Internet security issues have attained the universal 
public recognition or contempt of unsolicited bulk email 
— SPAM. In 2006, industry estimates suggest that such 
messages comprise over 80% over all Internet email with 
a total volume up to 85 billion per day [15,17]. The scale 
of these numbers underscores the prodigious delivery in- 
frastructures developed by “spammers” and in turn mo- 
tivates the more than $1B spent annually on anti-spam 
technology. However, the engine that drives this arms 
race is not spam itself — which is simply a means to an 
end — but the various money-making “scams” (legal or 
illegal) that extract value from Internet users. 

In this paper, we focus on the Internet infrastructure 
used to host and support such scams. In particular, we 


analyze spam-advertised Web servers that offer merchan- 
dise and services (e.g., pharmaceuticals, luxury watches, 
mortgages) or use malicious means to defraud users (e.g., 
phishing, spyware, trojans). Unlike mail-relays or bots, 
scam infrastructure is directly implicated in the spam 
profit cycle and thus considerably rarer and more valu- 
able. For example, a given spam campaign may use 
thousands of mail relay agents to deliver its millions of 
messages, but only use a single server to handle requests 
from recipients who respond. Consequently, the avail- 
ability of scam infrastructure is critical to spam prof- 
itability — a single takedown of a scam server or a spam- 
mer redirect can curtail the earning potential of an entire 
spam campaign. 

The goal of this paper is to characterize scam infras- 
tructure and use this data to better understand the dy- 
namics and business pressures exerted on spammers. To 
identify scam infrastructure, we employ an opportunis- 
tic technique called spamscatter. The underlying prin- 
ciple is that each scam is, by necessity, identified in the 
link structure of associated spams. To this end, we have 
built a system that mines email, identifies URLs in real 
time and follows such links to their eventual destina- 
tion server (including any redirection mechanisms put in 
place). We further identify individual scams by cluster- 
ing scam servers whose rendered Web pages are graph- 
ically similar using a technique called image shingling. 
Finally, we actively probe the scam servers on an ongo- 
ing basis to characterize dynamic behaviors like avail- 
ability and lifetime. Using the spamscatter technique on 
a large real-time spam feed (roughly 150,000 per day) we 
have identified over 2,000 distinct scams hosted across 
more than 7,000 distinct servers. Further, we character- 
ize the availability of infrastructure implicated in these 
scams and the relationship with business-related factors 
such as scam “type”, location and blacklist inclusion. 

The remainder of this paper is structured as follows. 
Section 2 reviews related measurement studies similar in 
topic or technique. In Section 3 we outline the struc- 
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ture and lifecycle of Internet scams, and describe in de- 
tail one of the more extensive scams from our trace as 
a concrete example. Section 4 describes our measure- 
ment methodology, including our probing system, image 
shingling algorithm, and spam feed. In Section 5, we 
analyze a wide range of characteristics of Internet scam 
infrastructure based upon the scams we identify in our 
spam feed. Finally, Section 6 summarizes our findings 
and concludes. 


2 Related work 


Spamscatter is an opportunistic network measurement 
technique [5], taking advantage of spurious traffic — 
in this case spam — to gain insight into “hidden” as- 
pects of the Internet — in this case scam hosting infras- 
tructure. As with other opportunistic measurement tech- 
niques, such as backscatter to measure Internet denial- 
of-service activity [20], network telescopes and Internet 
sinks [32] to measure Internet worm outbreaks [19, 21], 
and spam to measure spam relays [27], spamscatter pro- 
vides a mechanism for studying global Internet behavior 
from a single or small number of vantage points. 

We are certainly not the first to use spam for oppor- 
tunistic measurement. Perhaps the work most closely 
related to ours is Ramachandran and Feamster’s recent 
study using spam to characterize the network behavior of 
the spam relays that sent it [27]. Using extensive spam 
feeds, they categorized the network and geographic loca- 
tion, lifetime, platform, and network evasion techniques 
of spam relay infrastructure. They also evaluated the ef- 
fectiveness of using network-level properties of spam re- 
lays, such as IP blacklists and suspect BGP announce- 
ments, to filter spam. When appropriate in our analyses, 
we compare and contrast characteristics of spam relays 
and scam hosts; some scam hosts also serve as spam re- 
lays, for example. In general, however, due to the differ- 
ent requirements of the two underground services, they 
exhibit different characteristics; scam hosts, for exam- 
ple, have longer lifetimes and are more concentrated in 
the U.S. 

The Webb Spam Corpus effort harvests URLs from 
spam to create a repository of Web spam pages, Web 
pages created to influence Web search engine results or 
deceive users [31]. Although both their effort and our 
own harvest URLs from spam, the two projects differ 
in their use of the harvested URLs. The Webb Spam 
Corpus downloads and stores HTML content to create 
an offline data set for training classifiers of Web spam 
pages. Spamscatter probes sites and downloads content 
over time, renders browser screenshots to identify URLs 
referencing the same scam, and analyzes various charac- 
teristics of the infrastructure hosting scams. 

Both community and commercial services consume 


URLs extracted from spam. Various community services 
mine spam to specifically identify and track phishing 
sites, either by examining spam from their own feeds or 
collecting spam email and URLs submitted by the com- 
munity [1,6, 22,25]. Commercial Web security and fil- 
tering services, such as Websense and Brightcloud, track 
and analyze Web sites to categorize and filter content, 
and to identify phishing sites and sites hosting other po- 
tentially malicious content such as spyware and keylog- 
gers. Sites advertised in spam provide an important data 
source for such services. While we use similar data in our 
work, our goal is infrastructure characterization rather 
than operational filtering. 

Botnets can play a role in the scam host infrastructure, 
either by hosting the spam relays generating the spam 
we see or by hosting the scam servers. A number of 
recent efforts have developed techniques for measuring 
botnet structure, behavior, and prevalence. Cook et al. [9] 
tested the feasibility of using honeypots to capture bots, 
and proposed a combination of passive host and network 
monitoring to detect botnets. Bacher et al. [23] used hon- 
eynets to capture bots, infiltrate their command and con- 
trol channel, and monitor botnet activity. Rajab et al. [26] 
combined a number of measurement techniques, includ- 
ing malware collection, IRC command and control track- 
ing, and DNS cache probing. The last two approaches 
have provided substantial insight into botnet activity by 
tracking hundreds of botnets over periods of months. Ra- 
machandran and Feamster [27] provided strong evidence 
that botnets are commonly used as platforms for spam 
relays; our results suggest botnets are not as common for 
scam hosting. 

We developed an image shingling algorithm to deter- 
mine the equivalance of screenshots of rendered Web 
pages. Previous efforts have developed techniques to de- 
termine the equivalence of transformed images as well. 
For instance, the SpoofGuard anti-phishing Web browser 
plugin compares images on Web pages with a database of 
corporate logos [7] to identify Web site spoofing. Spoof- 
Guard compares images using robust image hashing, an 
approach employing signal processing techniques to cre- 
ate a compressed representation of an image [30]. Robust 
image hashing works well against a number of different 
image transformations, such as cropping, scaling, and fil- 
tering. However, unlike image shingling, image hashing 
is not intended to compare images where substantial re- 
gions have completely different content; refinements to 
image hashing improve robustness (e.g., [18,28]), but do 
not fundamentally extend the original set of transforms. 


3. The life and times of an Internet scam 


In this section we outline the structure and life cycle 
of Internet scams, and describe in detail one of the 
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Figure 1: Components of a typical Internet scam. 


more extensive scams from our trace as a concrete ex- 
ample. This particular scam advertises “Downloadable 
Software,” such as office productivity tools (Microsoft, 
Adobe, etc.) and popular games, although in general 
the scams we observed were diverse in what they offered 
(Section 5.1). 

Figure 1 depicts the life of a spam-driven Internet 
scam. First, a spam campaign launches a vast number 
of unsolicited spam messages to email addresses around 
the world; a large spam campaign can exceed | billion 
emails [12]. In turn the content in these messages fre- 
quently advertises a scam — unsolicited merchandise 
and services available through the Web — by embed- 
ding URLs to scam Web servers in the spam; in our data, 
roughly 30% of spam contains such URLs (Section 5.1). 
An example of spam that does not contain links would 
be “pump-and-dump” stock spam intended to manipu- 
late penny stock prices [3]; the recent growth of image- 
based stock spam has substantially reduced the fraction 
of spam using embedded URLs, shrinking from 85% in 
2005 to 55% in 2006 [12]. These spam campaigns can 
be comparatively brief, with more than half lasting less 
than 12 hours in our data (Section 5.4). For our exam- 
ple software scam, over 5,000 spam emails were used to 
advertise it over a weeklong period. 

Knowing or unsuspecting users click on URLs in 
spam to access content from the Web servers hosting the 
scams. While sometimes the embedded URL directly 
specifies the scam server, more commonly it indicates 
an intermediate Web server that subsequently redirects 
traffic (using HTTP or Javascript) on towards the scam 
server. Redirection serves multiple purposes. When 
spammer and scammer are distinct, it provides a sim- 
ple means for tagging requests with the spammer’s affil- 
iate identifier (used by third-party merchants to compen- 
sate independent “advertisers’’) and laundering the spam- 
based origin before the request reaches the merchant (this 
laundering provides plausible deniability for the mer- 
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Figure 2: Screenshots, hostnames, and IP addresses of 
different hosts for the “Downloadable Software” scam. 
The highlighted regions show portions of the page that 
change on each access due to product rotation. Image 
shingling is resilient to such changes and identifies these 
screenshots as equivalent pages. 


chant and protects the spammer from potential conflicts 
over the merchant’s advertising policy). If spammer and 
scammer are the same, a layer of redirection is still use- 
ful for avoiding URL-based blacklists and providing de- 
ployment flexibility for scam servers. In our traces, most 
scams use at least one level of redirection (Section 4). 

On the back end, scams may use multiple servers to 
host scams, both in terms of multiple virtual hosts (e.g., 
different domain names served by the same Web server) 
and multiple physical hosts identified by IP address (Sec- 
tion 5.2). However, for the scams in our spam feed, the 
use of multiple virtual hosts is infrequent (16% of scams) 
and multiple physical hosts is rare (6%); our example 
software scam is one of the more extensive scams, using 
at least 99 virtual hosts on three physical hosts. 

Finally, different Web servers (physical or virtual), and 
even different accesses to a scam using the same URL, 
can result in slightly different downloaded content for the 
same scam. Intentional randomness for evasion, rotating 
advertisements, featured product rotation, etc., add an- 
other form of aliasing. Figure 2 shows example screen- 
shots among different hosts for the software scam. To 
overcome these aliasing issues, we use screenshots of 
Web pages as a basis for identifying all hosts participat- 
ing in a given scam (Section 4.2). 

A machine hosting one scam may be shared with other 
scams, as when scammers run multiple scams at once or 
the hosts are third-party infrastructure used by multiple 
scammers. Sharing is common, with 38% of scams be- 
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ing hosted on a machine with at least one other scam 
(Section 5.3). One of the machines hosting the soft- 
ware scam, for example, also hosted a pharmaceutical 
scam called “Toronto Pharmacy” (which happened to be 
hosted on a server in Guangzhou, China). 


The lifetimes of scams are much longer than spam 
campaigns, with 50% of scams active for at least a week 
(Section 5.4). Furthermore, scam hosts have high avail- 
ability during their lifetime (most above 99%) and ap- 
pear to have good network connectivity (Section 5.5); the 
lifetime of our software scam ran for the entire measure- 
ment period and was available 97% of the time. Finally, 
scam hosts tend to be geographically concentrated in the 
United States; over 57% of scam hosts from our data 
mapped to the U.S. (Section 5.6.2). Such geographic 
concentration contrasts sharply with the location of spam 
relay hosts; for comparison, only 14% of spam relays 
used to send the spam to our feed are located in the U.S. 
Figure 3 shows the geographic locations of the spam re- 
lays and scam hosts for the software scam. The three 
scam hosts were located in China and Russia, whereas 
the 85 spam relays were located around the world in 30 
countries. 


The lifetimes, high availability, and good network con- 
nectivity, as well as the geographic diversity of spam 
relays compared with scam hosts, all reflect the funda- 
mentally different requirements and circumstances be- 
tween the two underground services. Spam relays re- 
quire no interaction with users, need only be available to 
send mail, but must be great enough in number to mit- 
igate the effects of per-host blacklists. Consequently, 
spam relays are well suited to “commodity” botnet in- 
frastructure [27]; one recent industry estimate suggests 
that over 80% of spam is in fact relayed by bots [13]. By 
contrast, scam hosts are naturally more centralized (due 
to hosting a payment infrastructure), require interactive 
response time to their target customers, and may — in 
fact — be hosting legal commerce. Thus, scam hosts are 
much more likely to have high-quality hosting infrastruc- 
ture that is stable over long periods. 


4 Methodology 


This section describes our measurement methodology. 
We first explain our data collection framework for prob- 
ing scam hosts and spam relays, and then detail our im- 
age shingling algorithm for identifying equivalent scams. 
Finally, we describe the spam feed we use as our data 
source and discuss the inherent limitations of using a sin- 
gle viewpoint. 


4.1 Data collection framework 


We built a data collection tool, called the spamscatter 
prober, that takes as input a feed of spam emails, ex- 
tracts the sender and URLs from the spam messages, and 
probes those hosts to collect various kinds of information 
(Figure 1). For spam senders, the prober performs a ping, 
traceroute, and DNS-based blacklist lookup (DNSBL) 
once upon receipt of each spam email. The prober per- 
forms more extensive operations for the scam hosts. As 
with spam senders, it first performs a ping, traceroute, 
and DNSBL lookup on scam hosts. In addition, it down- 
loads and stores the full HTML source of the Web page 
specified by valid URLs extracted from the spam (we do 
not attempt to de-obfuscate URLs). It also renders an 
image of the downloaded page in a canonical browser 
configuration using the KHTML layout engine [14], and 
stores a screenshot of the browser window. For scam 
hosts, the prober repeats these operations periodically for 
a fixed length of time. For the trace in this paper, we 
probed each host and captured a screenshot while vis- 
iting each URL every three hours. Starting from when 
the first spam email introduces a new URL into the data 
set, we probe the scam host serving that URL for a week 
independently of whether the probes fail or succeed. 

As we mentioned earlier, many spam URLs simply 
point to sites that forward the request onto another server. 
There are many possible reasons for the forwarding be- 
havior, such as tracking users, redirecting users through 
third-party affiliates or tracking systems, or consolidat- 
ing the many URLs used in spam (ostensibly to avoid 
spam filters) to just one. Occasionally, we also noticed 
forwarding that does not end, either indicating a miscon- 
figuration, programming error, or a deliberate attempt to 
avoid spidering. 

The prober accommodates a variety of link forwarding 
practices. While some links direct the client immediately 
to the appropriate Web server, others execute a series of 
forwarding requests, including HTTP 302 server redi- 
rects and JavaScript-based redirects. To follow these, the 
prober processes received page content to extract simple 
META refresh tags and JavaScript redirect statements. 
It then tracks every intermediate page between the initial 
link and the final content page, and marks whether a page 
is the end of the line for each link. Properly handling for- 
warding is necessary for accurate scam monitoring. Over 
68% of scams used some kind of forwarding, with an av- 
erage of 1.2 forwards per URL. 


4.2 Image shingling 


Many of our analyses compare content downloaded from 
scam servers to determine if the scams are equivalent. 
For example, scam hosts may serve multiple indepen- 
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Figure 3: Geographic locations of the spam relays and scam server hosts for the “Downloadable Software” scam. The 
three scam servers are located in China and Russia and shown with dark grey points. The 85 spam relays are located 
around the world in more than 30 different countries, and are shown in white. 


dent scams simultaneously, and we cannot assume that 
URLs that lead to the same host are part of the same 
scam. Similarly, scams are hosted on multiple virtual 
servers as well as distributed across multiple machines. 
As a result, we need to be able to compare content from 
scam servers on different hosts to determine whether 
they are part of the same scam. Finally, even for con- 
tent downloaded from the same URL over time, we need 
to determine whether the content fundamentally changes 
(e.g., the server has stopped hosting the scam but returns 
valid HTTP responses to requests, or it has transitioned 
to hosting a different scam altogether). 


Various kinds of aliasing make determining scam 
equivalence across multiple hosts, as well as over time, a 
challenging problem. One possibility is to compare spam 
messages within a window of time to identify emails 
advertising the same scam. However, the randomness 
and churn that spammers introduce to defeat spam fil- 
ters makes it extremely difficult to use textual informa- 
tion in the spam message to identify spam messages for 
the same scam (e.g., spam filters continue to struggle 
with spam message equivalence). Another possibility is 
to compare the URLs themselves. Unfortunately, scam- 
mers have many incentives not to use the same URL 
across spams, and as a result each spam message for 
a scam might use a distinct URL for accessing a scam 
server. For instance, scammers may embed unique track- 


ing identifiers in the query part of URLs, use URLs that 
contain domain names to different virtual servers, or sim- 
ply randomize URLs to defeat URL blacklisting. 


A third option is to compare the HTML content down- 
loaded from the URLs in the spam for equivalence. The 
problem of comparing Web pages is a fundamental oper- 
ation for any effort that identifies similar content across 
sites, and comparing textual Web content has been stud- 
ied extensively already. For instance, text shingling tech- 
niques were developed to efficiently measure the simi- 
larity of Web pages, and to scale page comparison to the 
entire Web [4, 29]. In principle, a similar method could 
be used to compare the HTML text between scam sites, 
but in practice the downloaded HTML frequently pro- 
vides insufficient textual information to reliably identify 
a scam. Indeed, many scams contained little textual con- 
tent at all, and instead used images entirely to display 
content on the Web page. Also, many scams used frames, 
iframes, and JavaScript to display content, making it dif- 
ficult to capture the full page context using a text-based 
Web crawler. 


Finally, a fourth option is to render screenshots of 
the content downloaded from scam sites, and to com- 
pare the screenshots for equivalence. Screenshots are 
an attractive basis for comparison because they sidestep 
the aforementioned problems with comparing HTML 
source. However, comparing screenshots is not without 
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its own difficulties. Even for the same scam accessed by 
the same URL over time — much less across different 
scam servers — scam sites may intentionally introduce 
random perturbations of the page to prevent simple im- 
age comparison, display rotating advertisements in vari- 
ous parts of a page, or rotate images of featured products 
across accesses. Figure 2 presents an example of screen- 
shots from different sites for the same scam that show 
variation between images due to product rotation. 


Considering the options, we selected screenshots as 
the basis for determining spam equivalence. To over- 
come the problems described earlier, we developed 
an image-clustering algorithm, called image shingling, 
based on the notion of shingling from the text similar- 
ity literature. Text shingling decomposes a document 
into many segments, usually consisting of a small num- 
ber of characters. Various techniques have been devel- 
oped to increase the efficiency and reduce the space com- 
plexity of this process [11]. Next, these hashed “shin- 
gles” are sorted so that hashes for documents containing 
similar shingles are close together. The ordering allows 
all the documents that share an identical shingle to be 
found quickly. Finally, documents are clustered accord- 
ing to the percentage of shared shingles between them. 
The power of the algorithm is that it essentially performs 
O(N?) comparisons in O(N lg N) time. 

Our image shingling algorithm applies a similar pro- 
cess to the image domain. The algorithm first divides 
each image into fixed size chunks in memory; in our ex- 
periments, we found that an image chunk size of 40x40 
pixels was an effective tradeoff between granularity and 
shingling performance. We then hash each chunk to cre- 
ate an image shingle, and store the shingle on a global list 
together with a link to the image (we use the MD4 hash 
to create shingles due to its relative speed compared with 
other hashing algorithms). After sorting the list of shin- 
gles, we create a hash table, indexed by shingle, to track 
the number of times two images shared a similar shingle. 
Scanning through the table, we create clusters of images 
by finding image pairs that share at least a threshold of 
similar images. 

To determine an appropriate threshold value, we took 
one day’s worth of screenshots and ran the image shin- 
gling algorithm for all values of thresholds in increments 
of 1%. Figure 4 shows the number of clusters created 
per threshold value. The plateau in the figure starting 
at 70% corresponds to a fair balance between being too 
strict, which would reduce the possibility of clustering 
nearly similar pages, and being too lenient, which would 
cluster distinct scams together. Manually inspecting the 
clusters generated at this threshold plateau and the cluster 
membership changes that occur at neighboring threshold 
values, we found that a threshold of 70% minimized false 
negatives and false positives for determining scam page 
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Figure 4: The choice of a threshold value for image shin- 
gling determines the number of clusters. 


equivalence. 

We have developed a highly optimized version of this 
basic algorithm that, in practice, completes an all-pairs 
comparison in roughly linear time. In practice, image 
shingling is highly effective at clustering similar scam 
pages, while neatly side-stepping the adversarial obfus- 
cations in spam messages, URLs, and page contents. 
Clearly, a determined scammer could introduce steps to 
reduce the effectiveness of image shingling as described 
(e.g., by slightly changing the colors of the background 
or embedded images on each access, changing the com- 
pression ratio of embedded images, etc.). However, we 
have not witnessed this behavior in our trace. If scam- 
mers do take such steps, this methodology will likely 
need to be refined. 


4.3 Spam feed and limitations 


The source of spam determines the scams we can mea- 
sure using this methodology. For this study, we have 
been able to take advantage of a substantial spam feed: 
all messages sent to any email address at a well-known 
four-letter top-level domain. This domain receives over 
150,000 spam messages every day. We can assume that 
any email sent to addresses in this domain is spam be- 
cause no active users use addresses on the mail server 
for the domain. Examining the “From” and “To” ad- 
dresses of spam from this feed, we found that spam- 
mers generated “To” email addresses using a variety of 
methods, including harvested addresses found in text on 
Web pages, universal typical addresses at sites, as well 
as name-based dictionary address lists. Over 93% of 
“From” addresses were used only once, suggesting the 
use of random source addresses to defeat address-based 
spam blacklists. 
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Characteristic 


Summary Result 





Trace period | 11/28/06 — 12/11/06 
Spam messages | 1,087,711 
Spam w/ URLs | 319,700 (30% of all spam) 
Unique URLs | 36,390 (11% of all URLs) 
Unique IP addresses | 7,029 (19% of unique URLs) 


Unique scams | 2,334 (6% of unique URLs) 


Table 1: Summary of spamscatter trace. 


We analyze Internet scam hosting infrastructure using 
spam from only a single, albeit highly active, spam feed. 
As with other techniques that use a single network view- 
point to study global Internet behavior, undoubtedly this 
single viewpoint introduces bias [2,8]. For example, the 
domain that provides our spam feed has no actual users 
who read the email. Any email address harvesting pro- 
cess that evaluates the quality of email addresses, such 
as correlating spam email targets with accesses on scam 
sites, would be able to determine that sending spam to 
these addresses yields no returns (that is, until we began 
probing). 

While measuring the true bias of our data is impos- 
sible, we can anecdotally gauge the coverage of scams 
from our spam feed by comparing them with scams iden- 
tified from an entirely different spam source. As a com- 
parison source, we used the spam posted to the Usenet 
group news.admin.net-abuse.sightings, a forum for ad- 
ministrators to contribute spam [22]. Over a single 3-day 
period, January 26—28th, 2007, we collected spam from 
both sources. We captured 6,977 spam emails from the 
newsgroup and 113,216 spam emails from our feed. The 
newsgroup relies on user contributions and is moderated, 
and hence is a reliable source of spam. However, it is 
also a much smaller source of spam than our feed. 

Next we used image shingling to distill the spam from 
both sources into distinct scams, 205 from the newsgroup 
and 1,687 from our feed. Comparing the scams, we 
found 25 that were in both sets, i.e., 12% of the news- 
group scams were captured in our feed as well. Of the 
30 most-prominent scams identified from both feeds (in 
terms of the number of virtual hosts and IP addresses), 
ten come from the newsgroup feed. These same ten, fur- 
thermore, were also in our feed. Our goal was not to 
achieve global coverage of all Internet scams, and, as ex- 
pected, we have not. The key question is how representa- 
tive our sample is; without knowing the full set of scams 
(a very challenging measurement task), we cannot gauge 
the representativeness of the scams we find. Character- 
izing a large sample, however, still provides substantial 
insight into the infrastructure used to host scams. And 
it is further encouraging that many of the most exten- 
sive scams in the newsgroup feed are also found in ours. 
Moving forward, we plan to incorporate other sources of 


Scam category 


% of scams 





Uncategorized 29.57% 
Information Technology 16.67% 
Dynamic Content 11.52% 
Business and Economy 6.23% 
Shopping 4.30% 
Financial Data and Services | 3.61% 
Illegal or Questionable 2.15% 
Adult 1.80% 
Message Boards and Clubs | 1.80% 
Web Hosting 1.63% 


Table 2: Top ten scam categories. 


spam to expand our feed and further improve representa- 
tiveness. 


5 Analysis 


We analyze Internet scam infrastructure using scams 
identified from a large one-week trace of spam mes- 
sages. We start by summarizing the characteristics of 
our trace and the scams we identify. We then evaluate 
to what extent scams use multiple hosts as distributed 
infrastructure; using multiple hosts can help scams be 
more resilient to defenses. Next we examine how hosts 
are shared across scams as an indication of infrastructure 
reuse. We then characterize the lifetime and availability 
of scams. Scammers have an incentive to use host infras- 
tructure that provides longer lifetimes and higher avail- 
ability; at the same time, network and system administra- 
tors may actively filter or take down scams, particularly 
malicious ones. Lastly, we examine the network and geo- 
graphic locations of scams; again, scammers can benefit 
from using stable hosts that provide high availability and 
good network connectivity. 

Furthermore, since spam relay hosts are an integral as- 
pect of Internet scams, where appropriate in our analyses 
we compare and contrast characteristics of spam relays 
and scam hosts. 


5.1 Summary results 


We collected the spam from our feed for a one-week pe- 
riod from November 28, 2006 to December 4, 2006. For 
every URL extracted from spam messages, we probed 
the host specified by the URL for a full week (inde- 
pendent of whether the host responded or not) starting 
from the moment we received the spam. As a result, the 
prober monitored some hosts for a week beyond the re- 
ceipt of the last spam email, up until December 11. Ta- 
ble 1 summarizes the resulting spamscatter trace. Start- 
ing with over | million spam messages, we extracted 
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Figure 5: Number of IP address and virtual domains per 
scam. 


36,390 unique URLs. Using image shingling, we iden- 
tified 2,334 scams hosted on 7,029 machines. Spam is 
very redundant in advertising scams: on average, 100 
spam messages with embedded URLs lead to only seven 
unique scams. 

What kinds of scams do we observe in our trace? We 
use a commercial Web content filtering product to de- 
termine the prevalence of different kinds of scams. For 
every URL in our trace, we use the Web content filter to 
categorize the page downloaded from the URL. We then 
assign that category to the scams referenced by the URL. 

Table 2 shows the ten most-prevalent scam categories. 
Note that we were not able to categorize all of the scams. 
We did not obtain access to the Web content filter until a 
few weeks after taking our traces, and 30% of the scams 
had URLs that timed out in DNS by that time (“Uncate- 
gorized” in the table). Further, 12% of the scams did not 
categorize due to the presence of dynamic content. The 
remaining 58% of scams fell into over 60 categories. Of 
these the most prevalent scam category was “Information 
Technology”, which, when examining the screenshots of 
the scam sites, include click affiliates, survey and free 
merchandise offers and some merchandise for sale (e.g., 
hair loss, software). Just over 2% of the scams were la- 
beled as malicious sites (e.g., containing malware). 


5.2 Distributed infrastructure 


We start by evaluating to what extent scams use multi- 
ple hosts as distributed infrastructure. Scams might use 
multiple hosts for fault-tolerance, for resilience in antici- 
pation of administrative takedown or blacklisting, for ge- 
ographic distribution, or even for load balancing. Also, 
reports of large-scale botnets are increasingly common, 
and botnets could provide a large-scale infrastructure for 
hosting scams; do we see evidence of botnets being used 
as a scalable platform for scam hosting? 


Scam category 


# of domains | # of IPs 





Watches 3029 3 
Pharmacy 695 4 
Watches 110 3 
Pharmacy 106 1 
Software 99 3 
Male Enhancement | 94 2 
Phishing 91 14 
Viagra 90 1 
Watches 81 1 
Software 80 45 


Figure 6: The ten largest virtual-hosted scams and the 
number of IP addresses hosting the scams. 


We count multiple scam hosting from two perspec- 
tives, the number of virtual hosts used by a scam and 
the number of unique IP addresses used by those virtual 
hosts. Overall, the scams from our trace are typically 
hosted on a single IP address with one domain name. 
Of the 2,334 scams, 2,195 (94%) were hosted on a sin- 
gle IP address and 1,960 (84%) were hosted on a sin- 
gle domain name. Only a small fraction of scams use 
multiple hosting. Figure 5 shows the tails of the distri- 
butions of the number of virtual hosts and IP addresses 
used by the scams in our trace, and Table 6 lists the top 
ten scams with the largest number of domains and IP ad- 
dresses. Roughly 10% of the scams use three or more 
virtual domains, and 1% use 15 or more. The top scams 
use hundreds of virtual domains, with one scam using 
over 3,000. Of the 6% of scams hosted on multiple IP 
addresses, only a few used more than ten, with one scam 
using 45. The relatively prevalent use of virtual hosts 
suggests that scammers are likely concerned about URL 
blacklisting and use distinct virtual hosts in URLs sent in 
different spam messages to defeat such blacklists. 

The scams in our trace do not use hosting infrastruc- 
ture distributed across the network extensively. Most 
scams are hosted on a single IP address, providing a po- 
tentially convenient single point for network-based in- 
terdiction either via IP blacklisting or network filtering. 
Assuming that scammers adapt to defenses to remain ef- 
fective, such filtering does not appear to be applied ex- 
tensively. Scam serving workloads are apparently low 
enough that a single host can satisfy offered load suffi- 
ciently to reap the benefits of the scam. Finally, if scams 
do use botnets as hosting infrastructure, then they are not 
used to scale a single scam. A scammer could poten- 
tially use a botnet to host multiple different scams, host- 
ing each scam on a separate distinct bot, but our method- 
ology would not identify this case. 

Those few scams hosted on multiple IP addresses, 
however, are highly distributed. Scams with multiple 
IP addresses were most commonly distributed outside of 
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Figure 7: The number of scams found on a server IP 
address. 


the same /24 prefix. Of the 139 distributed scams, all 
the hosts in 86% of the scams were located entirely on 
distinct /24 networks. Moreover, 64% of the distributed 
scams had host IP addresses that were all in entirely 
different ASes. As an example, one distributed scam 
was a phishing attack targeting a bank. The phishing 
Web pages were identical across 14 hosts, all in different 
/24 networks. The attack employed 91 distinct domain 
names. The domain names followed the same naming 
convention using a handful of common keywords fol- 
lowed by a set of numbers, suggesting the hosts were all 
involved in the distributed attack. The fully distributed 
nature of these scams suggests that scammers were con- 
cerned about resilience to defenses such as blacklisting. 


5.3 Shared infrastructure 


While we found that most scams are hosted on a single 
machine, a related question is whether these individual 
machines in turn host multiple scams, thereby sharing 
infrastructure across them. For each hosting IP address 
in our trace, we counted the number of unique scams 
hosted on that IP address at any time in the trace. Fig- 
ure 7 shows these results as a logscale histogram. Shared 
infrastructure is rather prevalent: although 1,450 scams 
(62%) were hosted on their own machines, the remaining 
38% of scams were hosted on machines hosting at least 
one other scam. Ten servers hosted ten or more scams, 
and the top three machines hosted 22, 18, and 15 differ- 
ent scams. This sharing of infrastructure suggests that 
scammers frequently either run multiple different scams 
on hosts that they control, or that hosts are made avail- 
able (sold, rented, bartered) to multiple scammers. 
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Figure 8: Overlap time for scam pairs on a server. 
Host type Classification | % of hosts recognized 
Spam relay | Open proxy 72.3% 

Scam host | Open proxy 2.06% 
Spam host 14.9% 


Table 3: Blacklist classification of spam relays and scam 
hosts. 


5.3.1 Sharing over time 


We further examined these shared servers to determine 
if they host different scams sequentially or if, in fact, 
servers are used concurrently for different scams. For 
each pair of scams hosted on the same IP address, we 
compared their active times and durations with each 
other. When they overlapped, we calculated the duration 
of overlap. We found that scams sharing hosts shared 
them at the same time: 96% of all pairs of scams over- 
lapped with each other when they remained active. Fig- 
ure 8 shows the distribution of time for which scams 
overlapped. Over 50% of pairs of scams overlapped 
for at least 125 hours. Further calculating the ratio of 
time that scams sharing hosts were active, we found that 
overlapped scams did not necessarily start and end at the 
same time: only 10% of scam pairs fully overlapped each 
other. 


5.3.2 Sharing between scam hosts and spam relays 


More broadly, how often do the same machines serve as 
both spam relays as well as scam hosting? Hosts used 
for both spam and scams suggest, for instance, that ei- 
ther the spammer and the scammer are the same party, 
or that a third party controls the infrastructure and makes 
it available for use by different clients. We can only es- 
timate the extent to which hosts play both roles, but we 
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estimate it in two ways. First, we determine the IP ad- 
dresses of all of the hosts that send spam into our feed. 
We then compare those addresses with the IP addresses 
of the scam hosts. Based upon this comparison, we find 
only a small amount of overlap (9.7%) between the scam 
hosts and spam relays in our trace. 

Scam hosts could, of course, serve as spam relays that 
do not happen to send spam to our feed. For a more 
global perspective, we identify whether the spam and 
scam hosts we observe in our trace are blacklisted on 
well-known Internet blacklists. When the prober sees an 
IP address for the first time (either from a host sending 
spam or from a scam host), it performs a blacklist query 
on that IP address using the DNSBLLookup Perl mod- 
ule [16]. 

Table 3 shows the percentage of blacklisted spam re- 
lays and scam hosts. This perspective identifies a larger 
percentage (17%) of scam hosts as also sending spam 
than we found by comparing scam hosts and open relays 
within our spam feed, but the percentage is still small 
overall. The blacklists are quite effective, though, at clas- 
sifying the hosts that send spam to our feed: 78% of those 
hosts are blacklisted. The query identifies most of the 
spam hosts as open spam relays — servers that forward 
mail and mask the identity of the true sender — whereas 
most blacklisted scam hosts are identified as just send- 
ing spam directly. These results suggest that when scam 
hosts are also used to send spam, they are rarely used as 
an open spam service. 


5.4 Lifetime 


Next we examine how long scams remain active and, in 
the next section, how stable they are while active. The 
lifetime of a scam is a balance of competing factors. 
Scammers have an incentive to use hosting infrastructure 
that provides longer lifetimes and higher availability to 
increase their rate of return. On the other hand, for exam- 
ple, numerous community and commercial services pro- 
vide feeds and products to help network administrators 
identify, filter or take down some scam sites, particularly 
phishing scams [1, 6, 22, 25]. 

We define the lifetime of a scam as the time between 
the first and last successful timestamp for a probe opera- 
tion during the two-week measurement period, indepen- 
dent of whether any probes failed in between (we look 
at the effect of failed probe attempts on availability be- 
low). We use two types of probes to examine scam host 
lifetime from different perspectives (Section 4). Periodic 
ping probes measure host network lifetime, and periodic 
HTTP requests measure scam server lifetime. Recall that 
we probe all hosts for a week after they appear in our 
spam feed — and no longer — to remove any bias to- 
wards hosts that appear early in the measurement study. 
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Figure 9: Lifetimes of individual scam hosts and Web 
servers, as well overall lifetimes of scams across multiple 
hosts. 


For comparison, we also calculate the lifetimes of entire 
scams. For scams that use multiple hosts, their lifetimes 
start when the first host appears in our trace and end with 
the lifetime of the last host to respond. As a result, scam 
lifetimes can exceed a week. 

How long are scams active? Figure 9 shows the dis- 
tributions of scam lifetime based upon these probes for 
the scams in our trace. For ping probes, we show the 
distribution of just those scam hosts that responded to 
pings (67% of all scam hosts). Scam hosts had long net- 
work lifetimes. Over 50% of hosts responded to pings for 
nearly the entire week that we probed them, and fewer 
than 10% of hosts responded to pings for less than 80 
hours. Given how close the distributions are, scam Web 
servers had only slightly shorter lifetimes overall. These 
results suggest that scam hosts are taken down soon after 
scam servers. 

Comparing the distribution of scam lifetimes to the 
others, we see that scams benefit from using multiple 
hosts. The 50% of scams whose lifetimes exceed a week 
indicate that the lifetimes of the individual scam hosts 
do not entirely overlap each other. Indeed, individual 
hosts for some scams appeared throughout the week of 
our measurement study, and the overall scam lifetime ap- 
proached the two weeks. 


5.4.1 Lifetime by category 


A substantial amount of community and commercial ef- 
fort goes into identifying malicious sites, such as phish- 
ing scams, and placing those sites on URL or DNS/IP 
blacklists. Thus, we would expect that the hosting in- 
frastructure for clearly malicious scams would be more 
transient than for other scams. To test this hypothesis, 
we used the categorization of scams to create a group 
of malicious scams that include the “Illegal or Question- 
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Figure 10: Scam lifetime distributions for malicious and 
shopping scams. 


able” and “Phishing” categories labeled by the Web con- 
tent filter (32 scams). For comparison, we also broke out 
another group of more innocuous shopping scams that 
include the “Shopping”, “Information Technology”, and 
“Auction” categories (701 scams). 

We examined the lifetimes and prevalence on black- 
lists of these scams. Figure 10 shows the lifetime distri- 
butions of the malicious and shopping groups of scams, 
and includes the distribution of all scams from Figure 9 
for reference. The malicious scams have a noticeably 
shorter lifetime than the entire population, and the shop- 
ping scams have a slightly longer lifetime. Over 40% 
of the malicious scams persist for less than 120 hours, 
whereas the lifetime for the same percentage of shopping 
scams was 180 hours and the median for all scams was 
155 hours. These results are consistent with malicious 
scam sites being identified and taken down faster than 
other scam sites, although we cannot verify the causality. 

As further evidence, we also examined the prevalence 
of malicious scams on the DNS blacklists we use in Sec- 
tion 5.3.2, and compare it to the blacklisting prevalence 
of all scams and the shopping scams. Over 28% of the 
malicious scams were blacklisted, roughly twice as often 
as the shopping scams (12% blacklisted) and all scams 
(15%). Again, these results are consistent with the life- 
times of malicious scams — being blacklisted twice as 
frequently could directly result in shorter scam lifetimes. 


5.4.2 Spam campaign lifetime 


A related aspect to scam lifetime are the “spam cam- 
paigns” used to advertise scams and attract clients. We 
captured 319,700 spam emails with links in our trace, 
resulting in 2,334 scams; on average, then, each scam 
was advertised by 137 spam emails. We use these re- 
peated spam emails to determine the lifetime of spam 
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Figure 11: The duration of a spam campaign. 


campaigns for a scam by measuring the time between 
the first and last spam email messages advertising that 
scam. Figure 11 shows the distribution of the spam cam- 
paign lifetimes. Compared to the lifetime of scam sites, 
most spam campaigns are relatively short. Over 50% 
of the campaigns last less than 12 hours, over 90% last 
less than 48 hours, and 99% last less than three days. 
Roughly speaking, the lifecycle of a typical scam starts 
with a short spam campaign lasting half of a day while 
the scam site remains up for at least a week. 

The relative lifetimes of spam campaigns and scam 
hosts again reflect the different needs of the two ser- 
vices. Compared with scam hosts, spam relays need to 
be active for much shorter periods of time to accomplish 
their goals. Spammers need only a window of time to 
distribute spam globally; once sent, spam relays are no 
longer needed for that particular scam. Scam hosts, in 
contrast, need to be responsive and available for longer 
periods of time to net potential clients. Put another way, 
spam is blanket advertising that requires no interaction 
with users to deliver, whereas scam hosting is a service 
that fundamentally depends upon user interaction to be 
successful. In contrast, scam hosts benefit more from 
stable infrastructure that remains useful and available for 
much longer periods of time. 


5.5 Stability 


A profitable scam requires stable infrastructure to serve 
potential customers at any time, and for as long as the 
scam is active. To gauge the stability of scam hosting 
infrastructure, we probed each scam host periodically for 
a week to measure its availability. When downloading 
pages from the hosts, we also used pOf to fingerprint host 
operating systems and link connectivity. 

We computed scam availability as the number of suc- 
cessful Web page downloads divided by the total number 
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Figure 12: IP addresses, binned by /24 prefix, for spam 
sending relays and scam host servers. 


of download attempts within the overall lifetime of the 
scam; if a scam lasted for only three days, we computed 
availability only during those days. Scams had excellent 
availability: over 90% of scams had an availability of 
99% or higher. Of the remaining, most had availabilities 
of 98% or higher. As fingerprinted by pOf, more scams 
ran on Unix or server appliances (43%) than Windows 
systems (30%), and all of them had reported good link 
connectivity. These results indicate that scam hosting is 
quite reliable within the lifetime of a scam. 


5.6 Scam location 


We next examine both the network and geographic loca- 
tions of scam hosts. For comparison, we also examine 
the locations of the spam relays that sent the spam in our 
trace. Comparing them highlights the extent to which the 
different requirements of the two services reflect where 
around the world and in the network they are found. 


5.6.1 Network location 


The network locations of spam relays and scam hosts are 
more consistent. Figure 12 shows the cumulative distri- 
bution of IP addresses for spam relays and scam hosts 
in our trace. Consistent with a similar analysis of spam 
relays in [27], the distributions are highly non-uniform. 
The IP addresses of most spam relays and scam hosts fall 
into the two same ranges, 58.* to 91.* and 200.* to 222.*. 
However, within those two address ranges hosts for the 
two services have different concentrations. The majority 
of spam relays (over 60%) fall into the first address range 
and are distributed somewhat evenly except for a gap be- 
tween 70.* and 80.*. Roughly half of the scam hosts 
also fall into the first address range, but most of those 


Scam host country 


% of all servers 





United States 57.40% 
China 7.23% 
Canada 3.70% 
Great Britain 3.07% 
France 3.06% 
Germany 2.52% 
Russia 1.80% 
South Korea 1.77% 
Japan 1.60% 
Taiwan 1.53% 
Other 16.32% 


Table 4: Countries of scam hosts. 


Spam relay country | % of all relays 





United States 14.50% 
France 7.06% 
Spain 6.75% 
China 6.65% 
Poland 5.68% 
India 5.42% 
Germany 5.00% 
South Korea 4.67% 
Italy 4.44% 
Brazil 3.86% 
Other 30.97% 


Table 5: Countries of spam relays. 


fall into the 64.* to 72.* subrange and relatively few in 
the second half of the range. Similarly, scams are more 
uniformly distributed within the second address range as 
well. 


5.6.2 Geographic location 


How do these variations in network address concentra- 
tions map into geographic locations? The effectiveness 
of scams could relate to (at least perceived) geographic 
location. As one anecdote, online pharmaceutical ven- 
dors utilized hosting servers inside the United States to 
imply to their customers that they were providing a law- 
ful service [24]. 

Using Digital Element’s NetAcuity tool [10], we 
mapped the IP addresses of scam hosts to latitude and 
longitude coordinates. Using these coordinates, we then 
identified the country in which the host was geographi- 
cally located. Table 4 shows the top ten countries con- 
taining scam hosts in our trace. Interestingly, the Ne- 
tAcuity service reported that nearly 60% of the scam 
hosts are located in the United States. Overall, 14% were 
located in Western Europe and 13% in Asia. For compar- 
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ison, Table 5 shows the top ten countries containing spam 
relays. The geographic distributions for spam relays are 
quite different than scam hosts. Only 14% of spam relays 
are located in the United States, whereas 28% are located 
in Western Europe and 16% in Asia. We also found the 
top ASes for scam hosts and senders, but found no dis- 
cernible pattern and omit the results for brevity. 

The strong bias of locating scam hosts in the United 
States suggests that geographic location is more impor- 
tant to scammers than spammers. There are a number 
of possible reasons for this bias. One is the issue of 
perceived enhanced credibility by scammers mentioned 
above. Another relates to the difference in requirements 
for the two types of services. As discussed in Sec- 
tion 5.4.2, spam relays can take advantage of hosts with 
much shorter lifetimes than scam hosts. As a result, spam 
relays are perhaps more naturally suited to being hosted 
on compromised machines such as botnets; the compro- 
mised machine need only be under control of the spam- 
mer long enough to launch the spam campaign. Scam 
hosts benefit more from stability, and hosts and networks 
within the United States can provide this stability. 


6 Conclusion 


This paper does not study spam itself, nor the infrastruc- 
ture used to deliver spam, but rather focuses on the scam 
infrastructure that is nourished by spam. We demonstrate 
the spamscatter technique for identifying scam infras- 
tructure and how to use approximate image comparison 
to cluster servers according to individual scams — side- 
stepping the extensive content and networking camou- 
flaging used by spammers. 

From a week-long trace of a large real-time spam feed 
(roughly 150,000 per day), we used the spamscatter tech- 
nique to identify and analyze over 2,000 distinct scams 
hosted across more than 7,000 distinct servers. We found 
that, although large numbers of hosts are used to ad- 
vertise Internet scams using spam campaigns, individual 
scams themselves are typically hosted on only one ma- 
chine. Further, individual machines are commonly used 
to host multiple scams, and occasionally serve as spam 
relays as well. This practice provides a potentially con- 
venient single point for network-based interdiction either 
via IP blacklisting or network filtering. 

The lifecycle of a typical scam starts with a short spam 
campaign lasting half of a day while the scam site re- 
mains up for at least a week. The relative lifetimes of 
spam campaigns and scam hosts reflect the different re- 
quirements of the two underground services. Spam is 
blanket advertising that requires no interaction with users 
to deliver, whereas scam hosting is a service that funda- 
mentally depends upon user interaction to be successful. 
Finally, mapping the geographic locations of scam hosts, 


we found that they have a strong bias to being located 
in the United States. The strong bias suggests that ge- 
ographic location is more important to scammers than 
spammers, perhaps due to the stability of hosts and net- 
works within the U.S. 
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Abstract 
E-mail has become indispensable in today’s networked 
society. However, the huge and ever-growing volume 
of spam has become a serious threat to this important 
communication medium. It not only affects e-mail re- 
cipients, but also causes a significant overload to mail 
servers which handle the e-mail transmission. 

We perform an extensive analysis of IP addresses and 
IP aggregates given by network-aware clusters in order 
to investigate properties that can distinguish the bulk of 
the legitimate mail and spam. Our analysis indicates that 
the bulk of the legitimate mail comes from long-lived IP 
addresses. We also find that the bulk of the spam comes 
from network clusters that are relatively long-lived. Our 
analysis suggests that network-aware clusters may pro- 
vide a good aggregation scheme for exploiting the his- 
tory and structure of IP addresses. 

We then consider the implications of this analysis for 
prioritizing legitimate mail. We focus on the situation 
when mail server is overloaded, and the goal is to maxi- 
mize the legitimate mail that it accepts. We demonstrate 
that the history and the structure of the IP addresses can 
reduce the adverse impact of mail server overload, by in- 
creasing the number of legitimate e-mails accepted by a 
factor of 3. 


1 Introduction 


E-mail has emerged as an indispensable and ubiquitous 
means of communication today. Unfortunately, the ever- 
growing volume of spam diminishes the efficiency of e- 
mail, and requires both mail server and human resources 
to handle. 

Great effort has focused on reducing the amount of 
spam that the end-users receive. Most Internet Service 
Providers (ISPs) operate various types of spam filters [1, 
4, 5, 13] to identify and remove spam e-mails before they 
are received by the end-user. E-mail software on end- 
hosts adds an additional layer of filtering to remove this 


unwanted traffic, based on the typical email patterns of 
the end-user. 

Much less attention has been paid to how the large 
volume of spam impacts the mail infrastructure within 
an ISP, which has to receive, filter and deliver them ap- 
propriately. Spammers have a strong incentive to send 
large volumes of spam — the more spam they send, the 
more likely it is that some of it can evade the spam fil- 
ters deployed by the ISPs. It is easy for the spammer 
to achieve this — by sending spam using large botnets, 
spammers can easily generate far more messages than 
even the largest mail servers can receive. In such con- 
ditions, it is critical to understand how the mail server 
infrastructure can be made to prioritize legitimate mail, 
processing it preferentially over spam. 

In this context, the requirements for differentiating be- 
tween spam and non-spam are slightly different from 
regular spam-filtering. The primary requirement for reg- 
ular spam-filtering is to be conservative in discarding 
spam, and for this, computational cost is not usually a 
consideration. However, when the mail server must pri- 
oritize the processing of legitimate mail, it has to use a 
computationally-efficient technique to do so. In addition, 
in this situation, even an imperfect distinction criterion 
would be useful, as long as a significant fraction of the 
legitimate mail gets classified correctly. 

In this paper, we explore the potential of using the 
historical behaviour of IP addresses to predict whether 
an incoming email is likely to be legitimate or spam. 
Using IP addresses for classification is computationally 
substantially more efficient than any content-based tech- 
niques. IP address information can also be collected eas- 
ily and is more difficult for a spammer to obfuscate. Our 
measurement studies show that IP address information 
provides a stable discriminator between legitimate mail 
and spam. We find that good mail servers send mostly 
legitimate mail and are persistent for significant periods 
of time. We also find that the bulk of spam comes from 
IP prefixes that send mostly spam and are also persis- 
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tent. With these two findings, we can use the properties 
of both legitimate mail and spam together, rather than us- 
ing the properties of only legitimate mail or only spam, 
in order to prioritize legitimate mail when needed. 

We show that these measurements are valuable in an 
application where legitimate mail must be prioritized. 
We focus on the situation when mail servers are over- 
loaded, i.e., they receive far more mail than they can pro- 
cess, even though the legitimate mail received is a tiny 
fraction of the total received. Since mail typically gets 
dropped at random when the server is overloaded, and 
spam can be generated at will, the spammer has an in- 
centive to overload the server. Indeed, the optimal strat- 
egy for the spammer is to increase the load on the mail 
infrastructure to a point where the most spam will be ac- 
cepted by the server; this kind of behaviour has been ob- 
served on the mail servers of large ISPs. In this paper, we 
show an application of our measurement study to design 
techniques based on the reputations of IP addresses and 
their aggregates and demonstrate the benefits to the mail 
server overload problem. 

The contributions of this paper are two-fold. We first 
perform an extensive measurement study in order to un- 
derstand some IP-based properties of legitimate mail and 
spam. We then perform a simulation study to evaluate 
how we can use these properties to prioritize legitimate 
mail when the mail server is overloaded. 

Our main results are the following: 


e We find that a significant fraction of legitimate mail 
comes from IP addresses that last for a long time, 
even though a very significant fraction of spam 
comes from IP addresses that are ephemeral. This 
suggests that the history of “good” IP addresses, 
that is, IP addresses that send mostly legitimate 
mail, could be used for prioritizing mail in spam 
mitigation. 


e We explore network-aware clusters as a candidate 
aggregation scheme to exploit structure in IP ad- 
dresses. Our results suggest that IP addresses 
responsible for the bulk of the spam are well- 
clustered, and that the clusters responsible for the 
bulk of the spam are persistent. This suggests that 
network-aware clusters may be good candidates to 
assign reputations to unknown IP addresses. 


e Based on our measurement results, we develop a 
simple reputation scheme that can prioritize IP ad- 
dresses when the server is overloaded. Our simula- 
tions show that when the server receives many more 
connection requests than it can process, our policy 
gives a factor of 3 improvement in the number of 
legitimate mails accepted. 


We note that the server overload problem is just one 
application that illustrates how IP information could be 
used for prioritizing email. This information could be 
used to prioritize e-mail at additional points of the mail 
server infrastructure as well. However, the kind of struc- 
tural information that is reflected in the IP addresses may 
not always be a perfect discriminator between spammers 
and senders of legitimate mail, and this is, indeed, re- 
flected in the measurements. Such structural IP informa- 
tion could, therefore, be used in combination with other 
techniques in a general-purpose spam mitigation system, 
and this information is likely to be useful by itself only 
when an aggressive and computationally-efficient tech- 
nique is needed. 

The remainder of the paper is structured as follows. 
We present our analysis of characteristics of IP addresses 
and network-aware clusters that distinguish between le- 
gitimate mail and spam in Sections 2 and 3 respectively. 
We present and evaluate our solution for protecting mail 
servers under overload in Section 4. We review related 
work in Section 5 and conclude in Section 6. 


2 Analysis of IP-Address Characteristics 


In this section, we explore the extent to which IP-based 
identification can be used to distinguish spammers from 
senders of legitimate e-mail based on differences in pat- 
terns of behaviour. 


2.1 Data 


Our data consists of traces from the mail server of a large 
company serving one of its corporate locations with ap- 
proximately 700 mailboxes, taken over a period of 166 
days from January to June 2006. The location runs a 
PostFix mail server with extensive logging that records 
the following: (a) every attempted SMTP connection, 
with its IP address and time stamp, (b) whether the con- 
nection was rejected, along with a reason for rejection, 
(c) if the connection was accepted, results of additional 
mail server’s local spam-filtering tests, and if accepted 
for delivery, the results of running SpamAssassin. 

Fig. 1(a) shows a daily summary of the data for six 
months. It shows four quantities for each day: (a) the 
number of SMTP connection requests made (including 
those that are denied via blacklists), (b) the number of 
e-mails received by the mail server, (c) the number of 
e-mails that were sent to SpamAssassin, and (d) the 
number of e-mails deemed legitimate by SpamAssassin. 
The relative sizes of these four quantities on every day 
illustrate the scale of the problem: spam is 20 times 
larger than the legitimate mail received. (In our data set, 
there were 1.4 million legitimate messages and 27 mil- 
lion spam messages in total.) Such a sharp imbalance 
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indicates the potential of a significant role for applica- 
tions like maximizing legitimate mail accepted when the 
server is overloaded: if there is a way to prioritize legiti- 
mate mail, the server could handle it much more quickly, 
because the volume of legitimate mail is tiny in compar- 
ison to spam. 

In the following analysis, every message that is con- 
sidered legitimate by SpamAssassin is counted as a le- 
gitimate message; every message that is considered spam 
by SpamAssassin, the mail server’s local spam-filtering 
tests, or through denial by a blacklist is counted as spam. 


2.2 Analysis of IP Addresses 


We first explore the behaviour of individual IP addresses 
that send legitimate mail and spam, with the goal of un- 
covering any significant differences in their behavioral 
patterns. 

Our analysis focuses on the [P spam-ratio of an IP ad- 
dress, which we define to be the fraction of mail sent 
by the IP address that is spam. This is a simple, intu- 
itive metric that captures the spamming behaviour of an 
IP address: a low spam-ratio indicates that the IP address 
sends mostly legitimate mail; a high spam-ratio indicates 
that the IP address sends mostly spam. Our goal is to 
see whether the historical communication behaviour of 
IP addresses categorized by their spam-ratios can differ- 
entiate between IP addresses of legitimate senders and 
spammers, for spam mitigation. 

As discussed earlier, the differentiation between the 
legitimate senders and spammers need not be perfect; 
there are benefits to having even a partial differentiation, 
especially with a simple, computationally inexpensive 
feature. For example, in the server overload problem, 
when all the mail cannot be accepted, a partial separa- 
tion would still help to increase the amount of legitimate 
mail that is received. 

In the IP-based analysis, we will address the following 
questions: 


e Distribution by IP Spam Ratio: What is the distri- 
bution of IP addresses by their spam-ratio, and what 
fraction of legitimate mail and spam is contributed 
by IP addresses with different spam-ratios? 


e Persistence: Are IP addresses with low/high spam- 
ratios present across long time periods? If they are, 
do such IP addresses contribute to a significant frac- 
tion of the legitimate mail/spam? 


e Temporal Spam-Ratio Stability: Do many of the IP 
addresses that appear to be good on average fluctu- 
ate between having very low and very high spam- 
ratios? 


The answers to these three questions, taken together, 
give us an idea of the benefit we could derive in using the 
history of IP address behaviour in spam mitigation. We 
show in Sec. 2.2.1, that most IP addresses have a spam- 
ratio of 0% or 100%, and also that a significant amount 
of the legitimate mail comes from IP addresses whose 
spam-ratio exceeds zero. In Sec. 2.2.2, we show that 
a very significant fraction of the legitimate mail comes 
from IP addresses that persist for a long time, but only 
a tiny fraction of the spam comes from IP addresses that 
persist for a long time. In Sec. 2.2.3, we show that most 
IP addresses have a very high temporal ratio-stability — 
they do not fluctuate between exhibiting a very low or 
very high daily spam-ratio over time. 

Together, these three observations suggest that iden- 
tifying IP addresses with low spam ratios that regularly 
send legitimate mail could be useful in spam mitigation 
and prioritizing legitimate mail. In the rest of this sec- 
tion, we present the analysis that leads to these observa- 
tions. For concreteness, we focus on how the analysis 
can help spam mitigation in the server overload problem. 





- --SMTP handshake received 
Mail received 

‘~:~ SpamAssassin applied 

— Legitimate Mail 





ba 
74 ne 
' 

ras yey 


. 
We tan haath tat ot 

a ge ay 4 

* av 


No. of messages 











20 40 60 80 100 120 140 160 
Time in days 


(a) Data characteristics 
































0.8} | 
g 
g 
z 
s 
= 06 
& 
S 
& 0.4, 
g 
on 
<7 
0.21 
L || 
0 ; 
0% 20% 40% 60% 80% 100% 


IP spam-ratio 


(b) CDFs of IP spam-ratios for many days: each line is a 
CDF for a different day. 


Figure 1: l(a): Daily summary of the data set over 6 
months. 1(b): CDFs of IP spam-ratios for many different 
days. 
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2.2.1 Distribution by IP Spam-Ratio 


In this section, we explore how the IP addresses and their 
associated mail volumes are distributed as a function of 
the IP spam-ratios. We focus here on the spam-ratio 
computed over a short time period in order to understand 
the behaviour of IP addresses without being affected by 
their possible fluctuations in time. Effectively, this anal- 
ysis shows the limits of the differentiation that could be 
achieved by using IP spam-ratio, even assuming that IP 
spam-ratio could be predicted for a given IP address over 
short periods of time. In this section, we focus on day- 
long intervals, in order to take into account possible time- 
of-day variations. We refer to the IP spam-ratio com- 
puted over a day-long interval as the daily spam-ratio. 

Intuitively, we expect that most IP addresses either 
send mostly legitimate mail, or mostly spam, and that 
most of the legitimate mail and spam comes from these 
IP addresses. If this hypothesis holds, then for spam mit- 
igation, it will be sufficient if we can identify the IP ad- 
dresses as senders of legitimate mail or spammers. To 
test this hypothesis, we analyze the following two empir- 
ical distributions: (a) the distribution of IP addresses as 
a function of the spam-ratios, and (b) the distribution of 
legitimate mail/spam as a function of their respective IP 
addresses’ spam-ratio. 

We first analyze the distribution of IP addresses by 
their daily spam-ratios in Fig. 1(b). For each day, 
it shows the empirical cumulative distribution func- 
tion (CDF) of the daily spam-ratios of individual IP ad- 
dresses active on that day. Fig. 1(b) shows this daily CDF 
for a large number of randomly selected days across the 
observation period. 


Result 1. Distribution of IP addresses: (i) Most IP 
addresses, send either mostly spam or mostly legitimate 
mail. (ii) Fewer than 1 — 2% of the active IP addresses 
have a spam-ratio of between 1% — 99%, i.e., there are 
very few IP addresses that send a non-trivial fraction of 
both spam and legitimate mail. (iii) Further, the vast 
majority (nearly 90%) of IP addresses on any given day 
generate almost exclusively spam, and have spam-ratios 


between 99% — 100%. 


The above results indicate that identifying IP ad- 
dresses with low or high spam-ratios could identify most 
of the legitimate senders and spammers. In addition, for 
some applications (e.g., the mail server overload prob- 
lem), it would be valuable to identify the IP addresses 
that send the bulk of the spam or the bulk of the legiti- 
mate mail, in terms of mail volume. To do so, we next 
explore how the daily legitimate mail or spam volumes 
are distributed as a function of the IP spam-ratios, and 
the resulting implications. 

Let J; denote the set of all IP addresses that have a 
spam-ratio of at most k. Fig. 2 examines how the volume 


of legitimate mail and spam sent by the set J; depends on 
the spam-ratio k. Specifically, let L;(k) and S;(k) be the 
fractions of the total daily legitimate mail and spam that 
comes from all IPs in the set J;,, on day 2. Fig. 2(a) plots 
L;(k) averaged over all the days, along with confidence 
intervals. Fig. 2(b) shows the corresponding distribution 
for the spam volume 5;(k:). 


Result 2. Distribution of legitimate mail volume: 
Fig. 2(a) shows that the bulk of the legitimate mail 
(nearly 70% on average) comes from IP addresses with a 
very low spam-ratio (k < 5%). However, a modest frac- 
tion (over 7% on average) also comes from IP addresses 
with a high spam-ratio (k > 80%). 
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Figure 2: Legitimate mail and spam contributions as a 
function of IP spam-ratio. 


Result 3. Distribution of spam volume: Fig. 2(b) in- 
dicates that almost all (over 99% on average) of the 
spam sent every day comes from IP addresses with an 
extremely high spam-ratio (when k > 95%). Indeed, the 
contribution of the IP addresses with lower spam-ratios 
(k < 80%) is a tiny fraction of the total. 


We observe that the distribution of legitimate mail vol- 
ume as a function of the spam-ratio k is more diffused 
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than the distribution of spam volume. There are two pos- 
sible explanations for such behaviour of the legitimate 
senders. First, spam-filtering software tends to be con- 
servative, allowing some spam to marked as legitimate 
mail. Second, a lot of legitimate mail tends to come 
from large mail servers that cannot do perfect outgoing 
spam-filtering. These mail servers may, therefore, have 
a slightly higher IP spam-ratio, and this would cause the 
distribution of legitimate mail to be more diffused across 
the spam-ratio. 

Together, the above results suggest that the IP spam- 
ratio may be a useful discriminating feature for spam 
mitigation As an example, assume that we have a clas- 
sification function that accepted (or prioritized) all IP 
addresses with a spam-ratio of at most k and rejected 
all IP addresses with a higher spam-ratio. Then, if we 
set k = 95%, we could accept (or prioritize) nearly all 
the legitimate mail, and no more than 1% of the spam. 
However, such a classification function requires perfect 
knowledge of every IP address’s daily spam-ratio every 
single day, and in reality, this knowledge may not be 
available. 

Instead, our approach is to identify properties that oc- 
cur over longer periods of time, and are useful for pre- 
dicting the current behaviour of an IP address based on 
long-term history, and these properties are incorporated 
into classification functions. The effectiveness of such 
history-based classification functions for spam mitiga- 
tion depends on the extent to which IP addresses long- 
lived, how much of the legitimate email or spam are 
contributed by the long-lived IP addresses, and to what 
extent the spam-ratio of an IP address varies over time. 
Sec. 2.2.2 and Sec. 2.2.3 explore these questions. 

For the following analysis, we focus on the spam-ratio 
of each individual IP address, computed over the entire 
data set, since we are interested in its behaviour over its 
lifetime. We refer to this as the lifetime spam-ratio of the 
IP address. We show the presence of two properties in 
this analysis: (i) a significant fraction of legitimate mail 
comes from good IP addresses that last for a long time 
(persistence), and (ii) IP addresses that are good on aver- 
age tend to have a low spam-ratio each time they appear 
(temporal stability). These two properties directly influ- 
ence how effective it would be to use historical informa- 
tion for determining the likelihood of spam coming from 
an individual IP address. 


2.2.2 Persistence 


Due to the community structure inherent in non-spam 
communication patterns, it seems reasonable that most of 
the legitimate mail will originate from IP addresses that 
appear and re-appear. Previous studies have also indi- 
cated that most of the spam comes from IP addresses that 


are extremely short-lived. These suggest the existence 
of a potentially significant difference in the behaviour of 
senders of legitimate mail and spammers with respect to 
persistence. We next quantify the extent to which these 
hypotheses hold, by examining the persistence of indi- 
vidual IP addresses. 

Our methodology for understanding the persistence 
behavior of IP addresses is as follows: we consider the 
set of all IP addresses with a low lifetime spam-ratio and 
examine how much legitimate mail they send, as well as 
how much of the legitimate mail is sent by IP addresses 
that are present for a long time. Such an understanding 
can indicate the potential of using a whitelist-based ap- 
proach for prioritizing legitimate mail. If, for instance, 
the bulk of the legitimate mail comes from IP addresses 
that last for a long time, we could use this property to 
prioritize legitimate mail from long-lasting IP addresses 
with low spam-ratios. 


10° 








_ 
S. 
= 








Number of IP Addresses 
_ 
S 








0 20 40 60 80 100 120 140 160 
Number of days 


(a) Number of k-good IP addresses present for x or 

















more days 
0.7 
o -v-k=1 
2°) rs -e-k=5 | | 
5 < 
a ~. * k=10 
got Fh | 
= ; on *-k=20 
* ~ 
do Pe w -m-k=30| | 
= Rn V4, 
Ep ‘ : 
2 0.3%, Be ae vay 
: \ ae aK se we 
& v-y Lo * HS J 
= 0.2 Vo -¥ -w_w_ r 
z VViev-v iy. Boy 
é WF See 
0.1 vv ty 








0 1 \ \ L \ \ 1 1 
0 20 40 60 80 100 120 140) 160 
Number of days 





(b) Fraction of legitimate mail sent by k-good IP ad- 
dresses present for x or more days 


Figure 3: Persistence of k-good IP addresses. 


For this analysis, we use the following two definitions. 


Definition 1. A k-good IP address is an IP address 
whose lifetime spam-ratio is at most k. A k-good set 
is the set of all k-good IP addresses. Thus, a 20-good set 
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is the set of all IP addresses whose lifetime spam-ratio is 
no more than 20%. 


We compute (a) the number of k-good IP addresses 
present for at least x distinct days, and (b) the fraction of 
legitimate mail contributed by k-good IP addresses that 
are present in at least x distinct days. ! Fig. 3(a) shows 
the number of IP addresses that appear in at least x dis- 
tinct days, for several different values of k. 

Fig. 3(b) shows the fraction of the total legitimate mail 
that originates from IP addresses that are in the k-good 
set and appear in at least x days, for each threshold k. 

Most of the IP addresses in a k-good set are not present 
very long, and the number of IP addresses falls quickly, 
especially in the first few days. However, their contribu- 
tion to the legitimate mail drops much more slowly as x 
increases. The result is that the few longer-lived IPs con- 
tribute to most of the legitimate mail from a k-good set. 
For example, only 5% of all IP addresses in the 20-good 
set appear at least 10 distinct days, but they contribute to 
almost 87% of all legitimate mail for the 20-good set. If 
the k-good set contributes to a significant fraction of the 
legitimate mail, then the few longer-lived IP addresses 
also contribute significantly to the total legitimate mail. 
For instance, IP addresses in the 20-good set contribute 
to 63.5% of the total legitimate mail received. Only 2.1% 
of those IP addresses are present for at least 30 days, but 
they contribute to over 50% of the total legitimate mail 
received. 


Result 4. Distribution of legitimate mail from persis- 
tent k-good IPs: Fig. 3 indicates that (i) IP addresses 
with low lifetime spam ratios (small k,) tend to contribute 
a major proportion of the total legitimate email, and (ii) 
only a small fraction of the IP addresses with a low life- 
time spam-ratio addresses appear over many days, but 
they contribute to a significant fraction of the legitimate 
mail. 


The graphs also reveal another trend: the longer an IP 
address lasts, the more stable is its contribution to the le- 
gitimate mail. For example, 0.09% of the IP addresses 
in the 20-good set are present for at least 60 days, but 
they contribute to over 40% of the total legitimate mail 
received. From this, we can infer that there were an addi- 
tional 1.2% of IP addresses in the 20-good set that were 
present for 30-59 days, but they only contributed to 10% 
of the total legitimate mail received. 


‘Our analysis considers persistence of IP addresses only in our data 
set, i.e., it considers whether the IP address has sent mail for x days 
to our mail server. These IP addresses may have sent mail to other 
mail servers on more days, and combining data across multiple differ- 
ent mail servers may give a better picture of stablility of IP addresses 
sending mail. Nevertheless, in this work, we focus on the persistence in 
one data set, as it highlights behavioural differences due to community 
structure present within a single vantage point. 


Fig. 4 presents a similar analysis of persistence for IP 
addresses with a high lifetime spam-ratio. Like the k- 
good IP addresses and k-good sets, we define k-bad IP 
addresses and k-bad sets. 


Definition 2. A k-bad IP address is an IP address that 
has a lifetime spam-ratio of at least k. A k-bad set is the 
set of all k-bad IP addresses. 
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Figure 4: Persistence of k-bad IP addresses. 


Fig. 4(a) presents the number of IP addresses in the k- 
bad set that are present in at least x days, and Fig. 4(b) 
presents the fraction of the total spam sent by IP ad- 
dresses in the k-bad set that are present in least x days. 


Result 5. Distribution of spam from persistent k-bad 
IPs: Fig. 4 indicates that (i) IP addresses with high 
lifetime spam ratios (large k) tend to contribute almost 
all of the spam, (ii) most of these high spam-ratio IPs are 
only present for a short time (this is consistent with the 
finding in [19]) and account for a large proportion of 
the overall spam, and (iii) the small fraction of these IPs 
that do last several days contribute a non-trivial fraction 
of the overall spam; however, a much larger fraction of 
spam comes from IP addresses that are not present for 
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very long. As in the case of the k-good IP addresses, the 
spam contribution from the k-bad IP addresses tends to 
get more stable with time. 


So, for instance, we can see from Fig. 4 that only 1.5% 
of the IP addresses in the 80-bad set appear in at least 
10 distinct days, and these contribute to 35.4% of the 
volume of spam from the 80-bad set, and 34% of the total 
spam. The difference is more pronounced for 100-bad IP 
addresses: 2% of the 100-bad IP addresses appear for 10 
or more distinct days, and contribute to 25% of the total 
spam volume. 

The results of this section have implications in design- 
ing spam filters, especially for applications where the 
goal is to prioritize legitimate mail rather than discard 
spam. While spamming IP addresses that are present suf- 
ficiently long can be blacklisted, the scope of a purely 
blacklisting approach is limited. On the other hand, a 
very significant fraction of the legitimate mail can be pri- 
oritized by using the history of the senders of legitimate 
mail. 


2.2.3 Temporal Stability 


Next, we seek to understand whether IP addresses in the 
k-good set change their daily spam-ratio dramatically 
over the course of their lifetime. The question we want 
to answer is: of the IP addresses that appear in a k-good 
set (for small values of &), what fraction of them have 
ever had “high” daily spam-ratios, and how often do they 
have “high” spam-ratios? Thus, we want to understand 
the temporal stability of the spam-ratio of IP addresses 
in k-good sets. In this section, we focus on k-good IP 
addresses; the results for the k-bad IP addresses are sim- 
ilar. 
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Figure 5: Temporal stability of IP addresses in k-good 
sets, shown by CCDF of frequency-fraction excess. 


We compute the following metric: for each IP ad- 
dress in a k-good set, we count how often its daily spam- 
ratio exceeds / (and normalize this count by the num- 


ber of days it appears). We define this quantity to be 
the frequency-fraction excess of the IP address, for the 
k-good set. We plot the complementary cdf (CCDF) of 
the frequency-fraction excess of all IP addresses in the k- 
good set. 7 Intuitively, the distribution of the frequency- 
fraction excess is a measure of how many IP addresses in 
the k-good set exceed k, and how often they do so. 

Fig. 5 shows the CCDF of the frequency-fraction ex- 
cess for several k-good sets. It shows that the majority 
of the IP addresses in each k-good set have a frequency- 
fraction excess of 0, and that 95% of the k-good IP ad- 
dresses have a frequency-fraction excess of at most 0.1. 

We explain the implications of Fig. 5 to the temporal 
stability of the spam-ratio of IP addresses with an exam- 
ple. We focus on the k-good set for k = 20: this is the 
set of IP addresses whose lifetime spam-ratio is bounded 
by 20%. We note that the frequency-fraction excess is 0 
for 95% of the 20-good IP addresses. This implies that 
95% of IP addresses in this k-good set do not send more 
than 20% spam any day, i.e., every time they appear, they 
have a daily spam-ratio of at most 20%. We also note that 
fewer than 1% of the IP addresses in this k-good set have 
a frequency-fraction excess larger than 0.2. 

Thus, for many k-good sets with small k-values, only 
a few IP addresses have a significant frequency-fraction 
excess, i.e., very few IP addresses in those sets exceed 
the value & often. Since they would need to exceed k 
often to change their spamming behaviour significantly, 
it follows that most IP addresses in the k-good set do not 
change their spamming behaviour significantly. 

In addition, the frequency-fraction excess is perhaps 
too strict a measure, since it is affected even if k is ex- 
ceeded slightly. We also compute a similar measure that 
increases only when k is exceeded by 5%. No more than 
0.01% of IP address in the k-good set exceed kh by 5%, 
for any k < 30%. Since we are especially interested in 
the temporal stability of IP addresses that appear often, 
we compute also the frequency-fraction excess distribu- 
tion for IP addresses that appear for 10, 20, 40 and 60 
days. In each case, almost no IP address exceeds k by 
more than 5%, for any k < 30%. 

We summarize this discussion in the following result. 


Result 6. Temporal stability of k-good IPs: Fig. 5 
shows that most IP addresses in k-good sets (for low k, 
e.g., k < 30%) do not exceed k often; i.e., most k-good 
IP addresses have low spam-ratios (at most k) nearly 
every day. 


With the above result, we can analyze the behaviour of 
k-good sets of IP addresses, constructed over their entire 
lifetime, and their behaviour in shorter time intervals. 


2That is, we plot the fraction of IP addresses in the k-good set whose 
frequency-fraction excess is at least x. The y-axis of the plot is re- 
stricted for readability. 
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The analysis of these three properties of IP addresses 
indicates that a significant fraction of the legitimate 
mail comes from IP addresses that persistently appear 
in the traffic. These IP addresses tend to exhibit stable 
behaviour: they do not fluctuate significantly between 
sending spam and legitimate mail. These results lend 
weight to our hypothesis that spam mitigation efforts can 
benefit by preferentially allocating resources to the sta- 
ble and persistent senders of legitimate mail. However, 
there is still a substantial portion of the mail that cannot 
be accounted for through only IP address-based analysis. 
In the next section, we focus on how to account for this 
mail. 


3 Analysis of Cluster Characteristics 


So far, we have analyzed whether the historical be- 
haviour of individual IP addresses can be used to dis- 
tinguish between senders of legitimate mail and spam- 
mers. However, if we only consider the history of indi- 
vidual IP addresses, we cannot determine whether a new, 
previously unseen, IP address is likely to be a spammer 
or a sender of legitimate mail. If there are many such 
IP addresses, then, in order to be useful, any prioritiza- 
tion scheme would need to assign these new IP addresses 
appropriate reputations as well. Indeed, in Sec. 2.2.2, 
we found that most IP addresses sending mail are short- 
lived and that such short-lived IPs account for a signifi- 
cant proportion of both legitimate mail and spam. Any 
prioritization scheme would thus need to be able to find 
reputations for these IP addresses as well. 

To address this issue, we explore whether coarser ag- 
gregations of IP addresses exhibit more persistence and 
afford more effective discriminatory power for spam mit- 
igation. If such aggregations of IP addresses can be 
found, the reputation of an unseen IP address could be 
derived from the historical reputation of the aggregation 
they belong to. 

We focus on IP aggregations given by network-aware 
clusters of IP addresses [15]. Network-aware clusters are 
sets of unique network IP prefixes collected from a wide 
set of BGP routing table snapshots. In this paper, an IP 
address belongs to a network-aware cluster if the longest 
prefix match of the IP address matches the prefix asso- 
ciated with the cluster. In the reputation mechanisms 
we explore in Sec. 4, an IP address derives the reputa- 
tion of the network-aware cluster that it belongs to. We 
use network-aware clustering because these clusters rep- 
resent IP addresses that are close in terms of network 
topology and do, with high probability, represent regions 
of the IP space that are under the same administrative 
control and share similar security and spam policies [15]. 

In this section, we present measurements suggesting 
that network-aware clusters of IP addresses may provide 


a good basis for reputation-based classification of IP ad- 
dresses. We focus on the following questions: 


e Granularity: Does the mail originating from 
network-aware clusters consist of mostly spam or 
mostly legitimate mail, so that these clusters could 
be useful as a reputation-granting mechanism for IP 
addresses? 


e Persistence: Do individual network-aware clusters 
appear (i.e., do IP addresses belonging to the clus- 
ters appear) over long periods of time, so that 
network-aware clusters could potentially afford us 
a useful mechanism to distinguish between differ- 
ent kinds of ephemeral IP addresses? 


As in the IP-address case, we adopt the spam-ratio of 
a network-aware cluster as the discriminating feature 
of clusters and examine whether clusters with low/high 
spam-ratios are granular and persistent. 

Before examining these two properties in detail, we 
first summarize our analysis of the properties with re- 
spect to which clusters behave as IP addresses do: clus- 
ters turn out to be at least as (and usually more) tem- 
porally stable as IP addresses (similar to the IP address 
behaviour explored in Sec. 2.2.3), which is the expected 
behaviour; the distribution of clusters by daily cluster 
spam-ratio is similar to the distribution of IP addresses 
by IP spam-ratio (similar to the IP address behaviour ex- 
plored in Sec. 2.2.1). 


3.1 Cluster Granularity 


For network-aware clustering of IP addresses to be use- 
ful, the clusters need to be sufficiently homogeneous in 
terms of their legitimate mail/spam behavior so that the 
cluster information can be used to separate the bulk of le- 
gitimate mail from the bulk of spam. Recall that with the 
IP addresses, we analyzed the extent to which IP spam- 
ratios could be used to identify the IP addresses send- 
ing the bulk of legitimate mail and spam. Here, we an- 
alyze whether, instead of an IP’s individual spam-ratio, 
the spam-ratio of the parent cluster can be used for the 
same purpose. 

To do so, we need to understand how well the clus- 
ter spam-ratio approximates the IP spam-ratio. In our 
context, we focus on the following question: can we still 
distinguish between the IP addresses that send the bulk of 
the legitimate mail and the bulk of the spam? If we can, 
within a margin of error, it would suggest that cluster- 
level analysis is nearly as good as IP-level analysis. 

For the analysis here, we determine the spam-ratio 
of each cluster by analyzing the mail sent by all IP ad- 
dresses belonging to that cluster and assign to IP ad- 
dresses the spam-ratios of their respective clusters. In 
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Figure 6: Penalty of using cluster-level analysis. 


the rest of this discussion, we will refer to legitimate 
mail/spam sent by IP addresses belonging to a cluster 
as the legitimate mail/spam sent by or coming from that 
cluster. As with the IP-based analysis, we examine how 
the volume of legitimate mail and spam from IP ad- 
dresses is distributed as a function of their cluster spam- 
ratios. To understand the additional error imposed by us- 
ing the cluster spam-ratio, we compare it with how those 
volumes are distributed as a function of the IP spam- 
ratio. 

Fig. 6(a) shows how the spam sent by IP addresses 
with a cluster or IP spam-ratio of at most & varies with 
k. Specifically, on day 7, let C'S;(k) and IS;(k) be the 
fraction of spam sent by the IP addresses with a cluster 
spam-ratio (and IP spam-ratio, respectively) of at most 
k. Fig. 6(a) plots C'S;(k) and I.S;(k) averaged over all 
the days in the data set, as a function of k, along with 
confidence intervals. 


Result 7. Distribution of spam with cluster and IP 
spam-ratios: Fig. 6(a) shows that almost all (over 95%) 
of the spam every day comes from IPs in clusters with a 
very high cluster spam-ratio (over 90%). A similar frac- 


tion (over 99% on average) of the spam every day comes 
from IP addresses with a very high IP spam-ratio (over 
90%). 


This suggests that spammers responsible for a high 
volume of the total spam may be closely correlated 
with the clusters that have a very high spam-ratio. The 
graph indicates that if we use a spam-ratio threshold of 
k < 90% for spam mitigation, then using the IP spam- 
ratio rather than the corresponding cluster spam-ratio as 
the discriminating feature would increase the amount of 
spam identified by less than 2%. This suggests that clus- 
ter spam-ratios are a good approximation to IP spam- 
ratios for identifying the bulk of the spam sent. 

We next consider how legitimate mail is distributed 
with the cluster spam-ratios and compare it with IP 
spam-ratios (Fig. 6(b)). We compute the following met- 
ric: Let CL;(k) and IL;(k) be the fraction of legitimate 
mail sent by IPs with cluster and IP spam-ratios of at 
most k on day 7. Fig. 6(b) plots CL,;(k) and IL;(k) av- 
eraged over all the days in the data set as a function of k, 
along with confidence intervals. 


Result 8. Distribution of legitimate mail with cluster 
and IP spam-ratios: Fig. 6(b) shows that a significant 
amount of legitimate mail is contributed by clusters with 
both low and high spam-ratios. A significant fraction of 
the legitimate mail (around 45% on average) comes from 
IP addresses with a low cluster spam-ratio (k < 20%). 
However, a much larger fraction of the legitimate mail 
(around 70%, on average) originates from IP addresses 
with a similarly low IP spam-ratio. 


The picture here, therefore, is much less promising: 
even when we consider spam-ratios as high as 30 — 40%, 
the cluster spam-ratios can only distinguish, on average, 
around 50% of the legitimate mail. By contrast, IP spam- 
ratios can distinguish as much as 70%. This suggests that 
IP addresses responsible for the bulk of legitimate mail 
are much less correlated with clusters of low spam-ratio. 

We can then make the following conclusion: suppose 
we use a classification function to accept or reject IP ad- 
dresses based on their cluster spam-ratio. What addi- 
tional penalty would we incur over a similar classifica- 
tion function that used the IP address’s own spam-ratio? 
Fig. 6(b) suggests that, if the threshold is set to 90% or 
higher, we incur very little penalty in both legitimate mail 
acceptance and spam. However, if the threshold is set to 
30 — 40%, we may incur as much as a 20% penalty in 
doing so. 

However, there are two additional ways in which such 
a classification function could be enhanced. First, as we 
have seen, the bulk of the legitimate mail does come from 
persistent k-good IP addresses. This suggests that we 
could potentially identify more legitimate mail by con- 
sidering the persistent k-good IP addresses in addition 
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to cluster-level information. Second, for some applica- 
tions, the correlation between high cluster spam-ratios 
and the bulk of the spam may be sufficient to justify us- 
ing cluster-level analysis. For example, under the exist- 
ing distribution of spam and legitimate mail, even a high 
cluster spam-ratio threshold would be sufficient to reduce 
the total volume of the mail accepted by the mail server. 
This is exactly the situation in the server overload prob- 
lem and we see the effect in the simulations in Sec. 4. 


3.2 Persistence 


Next, we explore how persistent the network-aware clus- 
ters are, just as we did for the IP addresses. We define a 
cluster to be present on a day if at least one IP address 
that belongs to that cluster appears that day. We reported 
earlier that we found the clusters themselves to be at least 
as (and usually more) temporally stable as IP addresses. 
Our next goal is to examine how much of the total legiti- 
mate mail/spam the long-lived clusters contribute. 

As in Sec. 2.2.2, we will define k-good and k-bad clus- 
ters; to do that, we use the lifetime cluster spam-ratio: 
the ratio of the total spam sent by the cluster to the total 
mail sent by it over its lifetime. 


Definition 3. A k-good cluster is a cluster of IP ad- 
dresses whose lifetime cluster spam-ratio is at most k. 
The k-good cluster-set is the set of all k-good clusters. A 
k-bad cluster is a cluster of IP addresses whose lifetime 
cluster spam-ratio is at least k. The k-bad cluster-set is 
the set of all k-bad clusters. 


Fig. 7(a) examines the legitimate mail sent by k-good 
clusters for small values of k. We first note that the 
k-good clusters (even when k is as large as 30%) con- 
tribute less than 40% of the total legitimate mail; this 
is in contrast to, for instance, 20-good IP addresses that 
contributed to 63.5% of the total legitimate mail. How- 
ever, we note the contribution from long-lived clusters is 
far more than from long-lived individual IPs. The dif- 
ference from Fig. 3(b) is striking: e.g., k-good clusters 
present for 60 or more days contribute to nearly 99% 
of the legitimate mail from the k-good cluster set. So, 
any cluster accounting for a non-trivial volume of legit- 
imate mail is present for at least 60 days. Indeed, the 
legitimate mail sent by k-good clusters drops to 90% of 
k-good cluster-set’s total only when restricted to clusters 
present for 120 or more days; by contrast, for individual 
IP addresses, the legitimate mail contribution dropped to 
87% of the 20-good set’s total after just 10 days. 

Fig. 7(b) presents the same analysis for k-bad clusters. 
Again, there are noticeable differences from the k-bad IP 
addresses, and also from the k-good clusters. A much 
larger fraction of spam comes from long-lived clusters 
than from long-lived IPs in Fig. 4(b). For example, over 
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Figure 7: Persistence of network-aware clusters. 


92% of the total spam is contributed by 90-bad clusters 
present for at least 20 days. This is in sharp contrast with 
the k-bad IP addresses, where only 20% of the total spam 
comes from IP addresses that last 20 or more days. We 
also note that the 90-bad cluster-set contributes to nearly 
95% of the total spam. Thus, in contrast to the legitimate 
mail sent by k-good cluster-sets, the bulk of the spam 
comes from the k-bad cluster-sets with high k. 


Result 9. Distribution of mail from persistent clus- 
ters: Fig. 7 shows that the clusters that are present for 
long periods with high cluster spam-ratios contribute 
the overwhelming fraction of the spam sent, while those 
present for long periods with low cluster spam-ratios 
contribute a smaller, though still significant, fraction of 
the legitimate mail sent. 


The above result suggests that network-aware cluster- 
ing can be used to address the problem of transience of IP 
addresses in developing history-based reputations of IP 
addresses: even if individual IP addresses are ephemeral, 
their (possibly collective) history would be useful in 
assigning reputations to other IP addresses originating 
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from the same cluster. 


4 Spam Mitigation under Mail Server 
Overload 


In the previous section, we have demonstrated that there 
are significant differences in the historical behaviour of 
IP addresses that send a lot of spam, and those that send 
very little. In this section, we consider how these differ- 
ences in behaviour could be exploited for spam mitiga- 
tion. 

Our measurements have shown that senders of le- 
gitimate mail demonstrate significant stability and per- 
sistence, while spammers do not. However, the bulk 
of the high volume spammers appear to be clustered 
well within many persistent network-aware clusters. To- 
gether, these suggest that we can design techniques based 
on the historical reputation of an IP address and the clus- 
ter to which it belongs. However, because mail rejec- 
tion mechanisms necessarily need to be conservative, we 
believe that such a reputation-based mechanism is pri- 
marily useful for prioritizing legitimate mail, rather than 
actively discarding all suspected spammers. 

As an application of these measurements, we now con- 
sider the mail-server overload problem described in the 
introduction. In this section, we demonstrate how the 
problem could be tackled with a reputation-based mech- 
anism that exploits these differences in behaviour. In 
Sec. 4.1, we explain the mail-server overload problem 
in more detail. In Sec. 4.2, we explain our approach, 
describing the mail server simulation and algorithms that 
we use, and in Sec. 4.3, we present an evaluation showing 
the performance improvement gained using these differ- 
ences in behaviour. 

We emphasize that this simulation study is intended 
to demonstrate the potential of using these behavioural 
differences in the legitimate mail and spam for prioritiz- 
ing exclusively by IP addresses. However, it is not in- 
tended to be comparable to content-based spam filtering. 
We also note that these differences in behaviour could be 
applied in other ways as well and at other points in the 
mail processing as well. The quantitative benefits that 
we achieve may be specific to our application and may 
be different in other applications. 


4.1 Server Overload Problem 


The problem we consider is the following: When the 
mail server receives more SMTP connections than it can 
process in a time interval, how can it selectively accept 
connections to maximize the acceptance of legitimate 
mail? That is, the mail server receives a sequence of 
connection requests from IP addresses every second, and 
each connection will send mail that is either legitimate or 


spam. Whether the IP address sends spam or legitimate 
mail in that connection is not known at the time of the 
request, but is known after mail is processed by the spam 
filter. The mail server has a finite capacity of the number 
of mails that can be processed in each time interval, and 
may choose the connections it accepts or rejects. The 
goal of the mail server is to selectively accept connec- 
tions in order to maximize the legitimate mail accepted. 

We note that spammers have strong incentive to cause 
mail servers to overload, and illustrate this with an exam- 
ple. Assume that a mail server can process 100 emails 
per second, that it will start dropping new incoming 
SMTP connections when its load reaches 100 emails per 
second, and that it crashes if the offered load reaches 200 
emails per second. Assume also that 20 legitimate emails 
are received per second. A spammer could increase the 
load of the mail server to 100% by sending 80 emails per 
second which would be all received by the mail server. 
Alternatively, the spammer could also increase the load 
to 199%, by sending 179 spam emails per second, and 
now nearly half the requests would not be served. If the 
mail server is unable to distinguish between the spam re- 
quests and the legitimate mail requests, it drops connec- 
tions at random, and the spammer will be able to suc- 
cessfully get through 89 spam emails per second to the 
mail server, as compared to the 80 in the previous case. 

Thus, the optimal operation point of a spammer, as- 
suming that he has a large potential sending capacity, is 
not the maximum capacity of the mail server but the max- 
imum load before the mail server will crash. This obser- 
vation indicates that the approach of throwing more re- 
sources at the problem would only work if the mail server 
capacity is increased to exceed the largest botnet avail- 
able to the spammer. This is typically not economically 
feasible and a different approach is needed. 

The results in Sec. 2 and Sec. 3 suggest that there 
may be a history-based reputation function R, that re- 
lates IP addresses to their likelihood of sending spam. 
Thus, for example, if R(7) is the probability that an IP ad- 
dress 2 sends legitimate mail, then maximizing the quan- 
tity }> R(z) would maximize the expected number of ac- 
cepted legitimate mail. If the reputation function R were 
known, this problem would be similar to admission con- 
trol and deadline scheduling; however, in our case, R is 
not known. 

In this work, we choose one simple history-based rep- 
utation function and demonstrate that it performs well. 
We reiterate that our goal is not to explore the space 
of the reputation functions or to find the best reputation 
function. Rather, our goal is to demonstrate that they 
could potentially be used to increase the legitimate mail 
accepted when the mail-server is overloaded. In addition, 
our goal is to preferentially accept e-mails from certain 
IP addresses only when the mail servers are overloaded 
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— we would like to minimize the impact on mail servers 
when they are not overloaded. A poor choice of R will 
then not impact the mail server under normal operation. 

The techniques and the reputation functions that we 
choose address concerns that are different from those 
addressed by standard IP-based classification techniques 
like blacklisting and greylisting, as neither blacklisting 
nor greylisting would directly solve the server overload 
problem. Blacklisting has well-known issues: building 
a blacklist takes time and effort, most IP addresses that 
send spam are observed to be ephemeral, appearing very 
few times, and many of them are not even present in any 
single blacklist. 

While greylisting is an attractive short-term solution 
that has been observed to work quite well in practice, it 
is not robust to spammer evasion, since spammers could 
simply mimick the behaviour of a normal mail server. 
Greylisting aims to optimize a different goal — its goal 
is to delay the mail in the hope that a spam signature 
is generated in the mean time, so that spam can be dis- 
tinguished from non-spam; however, delaying the mail 
does not reduce the overall server load, since the spam- 
mer can always return to send more mail, and comput- 
ing a content-based spam signature would continue to be 
as expensive. Indeed, greylisting gives spammers even 
more incentive to overload mail servers by re-trying af- 
ter a specified time period. 

Our techniques for the server overload problem pro- 
vide an additional layer of information when compared 
to blacklisting and greylisting. It may be possible to use 
the IP structure information to enhance greylisting, to 
decide, at finer granularities and with soft thresholding, 
which IP addresses to deny. 


4.2 Design and Algorithms 


Today, when mail servers experience overload, they drop 
connections greedily: the server accepts all connections 
until it is at maximum load, and then refuses all connec- 
tion requests until its load drops below the maximum. 
We aim to improve the performance under overload by 
using information in the structure of IP addresses, as sug- 
gested by the results in Sec. 2 and Sec. 3. At a high-level, 
our approach is to obtain a history of IP addresses and IP 
clusters, and use it to select the IP addresses that we pri- 
oritize under overload. To explore the potential benefits 
of this approach, we simulate the mail server operation 
and allow some additional functionality to handle over- 
load. 

To motivate our simulation, we describe briefly the 
way many mail servers in corporations and ISPs operate. 
First, the sender’s mail server or a mail relay tries to con- 
nect to the receiving mail server via TCP. The receiving 
mail server accepts the connection if capacity is avail- 


able, and then the mail servers perform the SMTP hand- 
shake and transfer the email. The receiving mail server 
stores the email to disk and adds it to the spam processing 
queue. For each e-mail on the queue, the receiving mail 
server then performs content-based spam filtering [3, 1] 
which is typically the most expensive part of email pro- 
cessing. After this, the spam emails are dropped or de- 
livered to a spam mailbox, and the good emails are de- 
livered to the inbox of the recipient. 

In our simulation we simplify the mail server model, 
while ensuring that it is still sufficiently rich to capture 
the problem that we explore. We believe that our model 
is sufficiently representative for a majority of mail server 
implementations used today; however, we acknowledge 
that there are mail server architectures in use which are 
not fully captured in our model. In the next section, we 
describe the simulation model in more detail. 


4.2.1 Mail Server Simulation 


We simulate mail-server operation in the following man- 
ner: 


e Phase 1: When the mail server receives an SMTP 
connection request, it may decide whether or not to 
accept the connection. If it decides to accept the 
connection, the incoming mail takes ¢ time units to 
be transferred to the mail server. Thus, if a server 
can accept & connection requests simultaneously, it 
behaves like a k-parallel processor in this phase. We 
do so because this phase models the SMTP hand- 
shake and transfer of mail, and therefore, it needs to 
model state for each connection separately. 


e Phase 2: Once the mail has been received, it is 
added to a queue for spam filtering and delivery to 
the receiving mailbox if any. At each time-step, the 
mail server selects mails from this queue and pro- 
cesses them; the number of mails chosen depend on 
the mail server’s capacity and the cost of each in- 
dividual mail. Here, since we model computation 
cycles, a sequential processing model suffices. The 
mail server has a timeout: it discards any mail that 
has been in the queue for more than m time units. 
If the load has sufficient fluctuation, a large timeout 
would be useful, but we want to minimize timeout 
since email has the expectation of being timely. 


We assume that the cost of denying/dropping a request 
is 0, the cost of processing the SMTP connection is a 
fraction of its total cost, and the cost of the remainder 
is 1 — a fraction of the total cost. We also allow Phase 
1 of the mail server simulator to have a fraction of the 
server’s computational resources, and Phase 2 to have 
the remainder. Since the content-based analysis is typ- 
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ically the most expensive part of processing a message, 
we expect that a is likely to be small. 

This two-phase simulation model allows for more flex- 
ibility in our policy design, since it opens the possibility 
of dropping emails which have already been received and 
are awaiting spam filtering without wasting too many re- 
sources. 


4.2.2 Policies 


Next, we present the prioritization/drop policies that we 
implemented and evaluated on the mail server simulator. 
In this simulation model, the default mail-server action 
corresponds to the following: at each time-interval, the 
server accepts incoming requests in the order of arrival, 
as long as it is not overloaded. Once mail has been re- 
ceived, the server processes the first mail in the queue, 
and discards any mail that has exceeded its timeout. We 
refer to this as the greedy policy.’ 

The space of policy options that a mail-server is al- 
lowed to operate determine the kinds of benefits it can 
get. In this problem, one natural option for the mail 
server is to decide immediately whether to accept or re- 
ject a connection request. However, such a policy may 
be quite sensitive to fluctuation in the workload received 
at the mail server. Another option may be to reject some 
e-mails after the SMTP connection has been accepted, 
but before any spam-filtering checks or content-based 
analysis (such as spam-filtering software) has been ap- 
plied. Note that content-based analysis typically is the 
most computationally expensive part of receiving mail. 
Thus, with this option, the mail server may do a small 
amount of work for some additional emails that eventu- 
ally get rejected, but is less affected by the fluctuation 
of mail arrival workload. We restrict the space of policy 
options to the time before any content-based analysis of 
the incoming mail is done. 

To solve the mail-server overload problem, we imple- 
ment the following policies at the two phases: 


e Phase-1 policy: The policy in Phase 1 is designed to 
preferentially accept IP addresses with a good rep- 
utation when the server is near maximum load: as 
the server gets closer to overload, the policy only ac- 
cepts IP addresses with better and better reputations. 
The policy itself is more complex, since it needs to 
consider the expected legitimate mail workload, and 
yet not stay idle too long. We therefore leave exact 
details to the appendix. In addition, when the load 
is below some percentage (we choose 75%) of the 


3To ensure that the current mail server policy is not unfairly mod- 
elled under this simulation model, we evaluated greedy policies in an- 
other simulation model, in which each connection took z time units to 
process from start to end. The performance of the greedy policy was 
similar, therefore we do not describe the model further. 


total capacity, the server accepts all mail: this way, 
it minimizes impact on normal operation of the mail 


server. 4 


e Phase-2 policy: The scheduling policy here is eas- 
ier to design, since the queue has some knowledge 
of what needs to be processed. Even a simple policy 
that greedily accepts the item with the highest rep- 
utation value will do well, as long as the reputation 
function is reasonably accurate. We use this greedy 
policy for Phase 2. 


Our history-based reputation function R is simple: 
First, we find a list of persistent senders of legitimate 
mail from the same time period (we choose all senders 
that have appeared in at least 10 days), and for these IP 
addresses, we use their lifetime IP spam-ratio as their 
reputation value. For the remaining IP addresses, we use 
their cluster spam-ratio as their reputation value: for each 
week, we use the history of the preceding four weeks in 
computing the lifetime spam-ratio (defined over 4 weeks) 
for each cluster that sends mail. > In this way, we com- 
bine the results of the IP-based analysis and cluster-based 
analysis in Sec. 2 in designing the reputation function. 

This reputation function is extremely simple, but it 
still illustrates the value of using a history-based rep- 
utation mechanism to tackle the mail server overload 
problem. We also note that the historical IP reputa- 
tions based on network-aware clusters in this manner 
may not always be perfect predictors of spamming be- 
haviour. While network-aware clusters are an aggrega- 
tion technique with a basis in network structure, they 
could serve as a starting point for more complex clus- 
tering techniques, and these techniques may also incor- 
porate finer notions of granularity and confidence. 

A more sophisticated approach to using the history of 
IP addresses and network-aware clusters that addresses 
these concerns is likely to yield an improvement in per- 
formance, but is beyond the scope of this paper and left 
as future work. In the following section, we describe the 
performance benefits that we gain from using this repu- 
tation function in the evaluation. 


4.3 Evaluation 


We evaluate our history-based policies by replaying the 
traces of our data set on our simulator. Since the traces 
record each connection request with a time-stamp, we 
can replay the traces to simulate the exact workload re- 
ceived by the mail server. We do so, with the simplifying 


4Technically, this is slightly more complex: it examines if the load 
is below 75% of the server capacity allowed to Phase 1. 

5One technical detail left to consider are the IP addresses originat- 
ing from clusters without history. In our reputation function, any IP 
address that has no history-based reputation value is given a slightly 
bad reputation. 
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assumption that each incoming e-mail incurs the same 
computational cost. Since our traces are fixed, we sim- 
ulate overload by decreasing the simulated server’s ca- 
pacity, and replaying the same traces. This way, we do 
not change the distribution and connection request times 
of IP addresses in the input traces between the different 
experiments. At the same time, it allows us to simulate, 
without changing the traces, how the mail server behaves 
as a function of the increasing workload. 

Simulation Parameters: We now explain the parame- 
ters that we choose for our simulation. We choose the 
time ¢ for the Phase 1 operation to be 4s.° We use 60s 
for the timeout m, the waiting time in the queue before 
Phase 2 (it implies that mail will be delivered within 1 
minute, or discarded after Phase 1). This appears to be 
sufficiently small so as to not noticeably affect the deliv- 
ery of legitimate mail.” 

To induce overload, we vary the capacity of the sim- 
ulated mail server to 200, 100, 66, 50, and 40 mes- 
sages/minute. The greedy policy processed an average 
of 95.2% of the messages received when the server ca- 
pacity was set to 200 messages/minute, as seen in Ta- 
ble 2. At capacities larger than 200 messages/minute, 
the number of messages processed by the greedy policy 
grows very slowly, indicating that this is likely to be an 
effect of the distribution of connection requests in the 
traces. For this reason, we take capacity of 200/minute 
as the required server capacity. We then refer to the other 
server capacities in relation to required server capacity 
for this trace workload: a server with capacity of 100 
messages/minute must process the same workload with 
half the capacity of the required server, so we define it 
to have an overload-factor of 2. Likewise, the server ca- 
pacities we test 200, 100, 66, 50 and 40 messages/minute 
have overload-factors of around 1, 2, 3, 4, and 5 respec- 
tively. 

Recall that the parameter a is the cost of processing 
the message at Phase 1. We expect a to impact the per- 
formance, so we test two values a = 0.1, 0.5 in the eval- 
uation; recall that a is likely to be small, and so a = 0.5 
is a conservative choice here. The value of a has no ef- 
fect on the performance of the greedy policy. For this 
reason, the discussion features only one greedy policy 
for all values of a. For the history-based policies, a 
sometimes has an effect on the performance, since these 
policies allow for a decision to be taken at Phase 2. We 
therefore refer to the history-based policies as 10-policy, 


©We vary t for Phase 1 between 2-4s: our traces have a recorded 
time granularity of 1s, and the maximum seen in the traces before a 
disconnect was 4s. This does not appear to impact the results pre- 
sented here, since both kinds of policies receive the same value of t. 
We present in the results for t = 4s 

7This value also has no noticeable impact on our results when m > 
20s suggesting that most of the legitimate mail is processed quickly, or 
not at all. 


and 50-policy, for a = 0.1 and 0.5 respectively. 


4.3.1 Impact on Legitimate mail 


We first compare the number of legitimate mails ac- 
cepted by the different policies over many time intervals, 
where each interval is an hour long. Since our goal is 
to maximize the amount of legitimate mail accepted, the 
primary metric we use is the goodput ratio: the ratio of 
legitimate mail accepted by the mail server to the total le- 
gitimate mail in the time interval. This is a natural metric 
to use, since it makes the different time intervals compa- 
rable, and so we can see if the policies are consistently 
better than the greedy policy, rather than being heavily 
weighted by the number of legitimate mails in a few time 
intervals. For the performance evaluation, we examine 
the average goodput ratio, the distribution of the goodput 
ratios and the goodput improvement factor. 

Average Goodput Ratio: Table 1 shows the average 
goodput ratios for the different policies under different 
levels of overload. It shows that, on average, for each of 
these overloads, the goodput of any of the policies is bet- 
ter than the Greedy policy. The difference is marginal at 
overload-factor |, and increases quickly as the overload- 
factor increases: at overload-factor 4, the average good- 
put ratio is 64.3—64.5% for any of the history-based poli- 
cies, in comparison to 26.8% for the greedy policy. We 
also observe that the history-based policies scale more 
gracefully with the overload. Thus, we conclude that, 
on average, the history-based policies gain a significant 
improvement over the greedy policy. 

Distribution of Goodput Ratios: While the average 
goodput ratio is a useful summarization tool, it does not 
give a complete picture of the performance. For this 
reason, we next compare the distribution of the server 
goodput in the different time intervals. Fig. 8(a)-(b) 
shows the CDF of the goodput ratios for the different 
policies, for two overload-factors: 1 and 4. We ob- 
serve that the goodput ratio distributions are quite sim- 
ilar for the greedy and history-based policies when the 
overload-factor is 1 (Fig. 8(a)): over 40% of the time, 
all of the policies accept 100% messages. This changes 
drastically as the overload-factor increases. Fig. 8(b) 
shows the goodput ratio distributions for overload-factor 
4. As much as 50% of the time, the greedy policy has a 
goodput-ratio of at most 0.2. By contrast, more than 90% 
of the time, the history-based policies have a goodput ra- 
tio of at least 0.5. The results show that the the history- 
based policies have a consistent and significant improve- 
ment over the greedy policy when the load is sufficiently 
high. 

Improvement factor of Goodput-Ratios: Finally, we 
compare the goodput ratios on a per-interval basis. For 
this analysis, we focus on the 10-policy; our goal is to 
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see how often the 10-policy does better than the greedy 


algorithm. That is, for each time interval, we compute 

Goodput of 10-Policy 
Goodput of Greedy ° 

Fig. 8(c) plots how often goodput-factor lies between 


90% — 300% for the different overload-factors. We note 
that when the overload-factor is 1, the performance im- 
pact of our history-based policy on the legitimate mail 
is marginal: in all the time intervals, the 10-policy has 
a goodput-factor of 90%, and 99% of the time, it has a 
goodput factor of 99%. As the overload-factor increases, 
the amount of time intervals in which the 10-policy has a 
goodput-factor of 100% or more increases, meaning the 
number of time intervals in which the 10-policy does bet- 
ter than the Greedy algorithm increases, as we would ex- 
pect. When the overload-factor is 4, for example, 66% of 
the time, the goodput-factor is 200%: 10-policy accepts 
at least twice as many legitimate mail. We conclude that 
in most time intervals, the history-based policies perform 
better than the greedy policy, and the factor of their im- 
provement increases as the overload-factor increases. 

Lastly, we note that the behaviour of the 10-policy 
and the 50-policy does not appear to differ too much 
when the overload-factor is sufficiently high or suffi- 
ciently low. With intermediate overload-factors, they 
perform slightly differently, as we see in Table 1: the 
50-policy tends to be a little more conservative about ac- 
cepting messages that may not have a good reputation in 
comparison to the 10-policy. 


the goodput-factor, defined to be 


4.3.2. Impact on Throughput and Spam 


While our primary metric of performance is the goodput, 
we are still interested in the impact of using the history- 
based policies on the total messages and spam processed 
by the mail server. While these are not our primary goals, 
they are still important since they give a picture of the 
complete effect of using these history-based policies. 

Impact on Server Throughput: The history-based poli- 
cies obviously gain their improvement by selectively 
choosing the IP addresses to process: it selectively ac- 
cepts only good IP addresses in the incoming workload, 
if it is likely that the whole workload might not be pro- 
cessed. This may result in a decrease in server through- 
put in comparison to the greedy policy for certain load. 
For example, if the server receives a little less workload 
than it could process, the history-based policies may pro- 
cess fewer messages than the greedy policy, because they 
may reserve capacity for good IP addresses that they ex- 
pect to see but which never actually appear. We observe 
this in our simulations and we discuss it now. 

We define throughput to be fraction of the total mes- 
sages processed by the server. Table 2 shows the av- 
erage throughput achieved by both policies under vari- 
ous capacities of the server. At overload-factor 1, when 
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Figure 8: (a) and (b): CDF of the goodput-ratios for two 
different overload-factors. (c) shows performance im- 
provement (goodput-factor) for the 10-policy for various 
overload factors 





Table 1: Server Goodput (average, in %). 
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Table 2: Server throughput (average, in %). 
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Table 3: Spam accepted (average, in %). 


the greedy algorithm achieves an average throughput of 
95%, the history-based policy algorithm achieves an av- 
erage throughput of 93%. However, even at this point, 
the history-based policies accept a little more legitimate 
mail (on average) than the greedy policy. Note that by de- 
sign, the history-based policies guarantee that when the 
server receives no more than 75% of its maximum load 
capacity, its performance is no different from normal. 
Impact on Spam: We also explored the effect of 
the history-based policies on the number of spam mes- 
sages accepted. Table 3 shows the average fraction 
of spam messages accepted by the policies under var- 
ious overload factors. We see with an overload-factor 
of 1, the history-based policies accept only 0.3 — 1% 
less spam than the Greedy algorithm. As the overload- 
factor increases and the history-based policies grow more 
and more conservative in accepting suspected spam, the 
amount of spam accepted will decrease. For example, at 
a overload-factor of 2, this drops to 50.2% — 65.5% for 
the history-based policies. When the overload-factor in- 
creases to 4, the history-based policies accept less than 
1/2 of the amount of spam accepted by the greedy pol- 
icy. This suggests that if the server receives much more 
workload than it can process, the spam is affected much 
more than the legitimate mail. Therefore, the spammer 
would not have an incentive to increase the workload sig- 
nificantly, since it is the spam that gets most affected. 
Thus, we have shown that our history-based policies 
achieve a significant and consistent performance im- 
provement over the greedy policy when the server is 
under overload: we have seen this with multiple met- 
rics of the goodput ratio. We have also seen that the 
history-based policies do not impact the performance of 
the server too much when the server is not under over- 


load. Finally, we have seen that the the spam is indeed 
affected when the server is significantly overloaded; this 
is precisely the behaviour we want to induce. 


5 Related Work 


Since spam is so pervasive, much effort has been ex- 
pended in developing techniques that mitigate spam, and 
studies that understand various characteristics of spam- 
mers. In this section, we briefly survey some of the 
most related work. We first describe spam mitigation ap- 
proaches and how they may relate to our work on the 
server overload problem. Then we discuss measurement 
studies that are related and complementary to our mea- 
surement work. 

Traditionally, the two primary approaches to spam 
mitigation have used content-based spam-filtering and 
DNS blacklists. Content-based spam-filtering soft- 
ware [3, 1] is typically applied at the end of the mail pro- 
cessing queue, and there has been a lot of research [20, 
17, 7, 16] in techniques for content-based analysis and 
understanding its limits. Agarwal et al. [6] propose 
content-based analysis to rate-limit spam at the router; 
this also reduces the load on the mail server, but is not 
useful for our situation as it may be too computationally 
expensive. 

DNS blacklists [4, 5] and are another popular way to 
reduce spam. Studies on DNS blacklists[14] have shown 
that over 90% of the spamming IP addresses were present 
in at least one blacklist at their time of appearance. Our 
approach is complementary to traditional blacklisting, 
and the more recent greylisting [13] techniques — we aim 
to prioritize the legitimate mail, and use the history of IP 
addresses to identify potential spammers. 

Perhaps the closest in spirit to our work in mitigating 
server overload are those of Twining et al. [23] and Tang 
et al. [21]. Twining et al. describe a prioritization mech- 
anism that delays spam more than it delays legitimate 
mail. However, their problem is different, as they eventu- 
ally accept all email, but just delay the spam. Such an ap- 
proach would not work when all the mail simply cannot 
be accepted. While Tang et al. [21] do not consider the 
problem of server overload, they describe a mechanism 
to assign trust to and classify IP addresses using SVMs. 
Our work differs in the way it gets the historical reputa- 
tions — rather than using a blackbox learning algorithm, 
it uses the IP addresses and network-aware clusters, thus 
directly utilizing the structure of the network. 

There has also been interest in using reputation mech- 
anisms for identifying spam. There are a few commer- 
cial IP-based reputation systems (e.g., SenderBase [2], 
TrustedSource [22]). A general reputation system for 
internet defense has been proposed in [9]. There has 
been work on using social network information for 
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designing reputation-granting mechanisms to mitigate 
spam [10, 11, 8]. Prakash et al. [18] propose community- 
based filters trained with classifiers to identify spam. Our 
work differs from these reputation systems as it demon- 
strates the potential of using network-aware clusters to 
assign reputations to IP addresses for prioritizing legiti- 
mate mail. 

Recently, there have been studies on characterizing 
spammers, legitimate senders and mail traffic, and we 
only discuss the most closely related work here. Ra- 
machandran and Feamster [19] present a detailed anal- 
ysis of the network-level characteristics of spammers. 
By contrast, our work focuses on the comparison be- 
tween legitimate mail and spam and explores the stabil- 
ity of legitimate mail. We also use network-aware clus- 
ters to probabilistically distinguish the bulk of the legit- 
imate mail from the spam. Gomes et al. [12] study the 
e-mail arrivals, size distributions and temporal locality 
that distinguish spam traffic from non-spam traffic; these 
are interesting features that distinguish spam and legiti- 
mate traffic patterns and provide general insights into be- 
haviour. Our measurement study differs as it focuses on 
understanding the historical behaviour of mail servers at 
the network level that can be exploited to practical spam 
mitigation. 


6 Conclusion 


In this paper, we have focused on using IP addresses as a 
computationally-efficient tool for spam mitigation in sit- 
uations when the distinction need not be perfectly accu- 
rate. We performed an extensive analysis of IP addresses 
and network-aware clusters to identify properties that can 
distinguish the bulk of the legitimate mail and spam. Our 
analysis of IP addresses indicated that the bulk of the le- 
gitimate mail comes from long-lived IP addresses, while 
the analysis of network-aware clusters indicated that the 
bulk of the spam comes from clusters that are relatively 
long-lived. With these insights, we proposed and simu- 
lated a history-based reputation mechanism for prioritiz- 
ing legitimate mail when the mail server is overloaded. 
Our simulations show that the history and the structure 
of the IP addresses can be used to substantially reduce 
the adverse impact of mail server overload on legitimate 
mail, by up to a factor of 3. 
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A Appendix 


We present here the details of the policy used in Phase 1. 
for the history-based policies. In detail, the policy is the 
following: If the load is less than 75% of its capacity, the 
policy accepts all SMTP connections requests, regardless 
of the reputation of the IP address. If the load is greater 
than 75% of the capacity, the policy starts considering 
the reputation of the IP address and the legitimate mail 
that it expects to have to process in the near future. 

For this purpose, it uses a distribution of the number of 
emails expected in the next ¢ time units from reputation 
value at most & (for multiple k values), that is calculated 
based on the history of the distribution of mail arrival. 
Since our reputation function is the lifetime spam-ratio, 
a low reputation value is a good reputation, and a high 
reputation value is a bad reputation. Then it does the fol- 
lowing: (a) given the current load, it computes the small- 
est k’ such that all expected mail with reputations with 
k, < k’ can be processed on the server (b) it looks up 
the reputation of the IP address, and checks if it is higher 
than k’. (If the IP address does not have a known rep- 
utation value, and it does not belong to a cluster with a 
known reputation, then the IP address is assigned a rel- 
atively higher k’ value. If k’ < k, then the connection 
request of IP address is accepted, otherwise, it is rejected. 
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Abstract 


We present a new kind of network perimeter monitoring 
strategy, which focuses on recognizing the infection and 
coordination dialog that occurs during a successful mal- 
ware infection. BotHunter is an application designed to 
track the two-way communication flows between inter- 
nal assets and external entities, developing an evidence 
trail of data exchanges that match a state-based infec- 
tion sequence model. BotHunter consists of a correla- 
tion engine that is driven by three malware-focused net- 
work packet sensors, each charged with detecting spe- 
cific stages of the malware infection process, includ- 
ing inbound scanning, exploit usage, egg downloading, 
outbound bot coordination dialog, and outbound attack 
propagation. The BotHunter correlator then ties together 
the dialog trail of inbound intrusion alarms with those 
outbound communication patterns that are highly indica- 
tive of successful local host infection. When a sequence 
of evidence is found to match BotHunter’s infection di- 
alog model, a consolidated report is produced to capture 
all the relevant events and event sources that played a role 
during the infection process. We refer to this analytical 
strategy of matching the dialog flows between internal 
assets and the broader Internet as dialog-based correla- 
tion, and contrast this strategy to other intrusion detec- 
tion and alert correlation methods. We present our exper- 
imental results using BotHunter in both virtual and live 
testing environments, and discuss our Internet release of 
the BotHunter prototype. BotHunter is made available 
both for operational use and to help stimulate research in 
understanding the life cycle of malware infections. 


1 Introduction 


Over the last decade, malicious software or malware has 
risen to become a primary source of most of the scan- 
ning [38], (distributed) denial-of-service (DOS) activi- 
ties [28], and direct attacks [5], taking place across the 
Internet. Among the various forms of malicious soft- 
ware, botnets in particular have recently distinguished 
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themselves to be among the premier threats to computing 
assets [20]. Like the previous generations of computer 
viruses and worms, a bot is a self-propagating applica- 
tion that infects vulnerable hosts through direct exploita- 
tion or Trojan insertion. However, all bots distinguish 
themselves from the other malware forms by their abil- 
ity to establish a command and control (C&C) channel 
through which they can be updated and directed. Once 
collectively under the control of a C&C server, bots form 
what is referred to as a botnet. Botnets are effectively 
a collection of slave computing and data assets to be 
sold or traded for a variety of illicit activities, including 
information and computing resource theft, SPAM pro- 
duction, hosting phishing attacks, or for mounting dis- 
tributed denial-of-service (DDoS) attacks [12, 34, 20]. 


Network-based intrusion detection systems (IDSs) and 
intrusion prevention systems (IPSs) may come to mind as 
the most appealing technology for detecting and mitigat- 
ing botnet threats. Traditional IDSs, whether signature 
based [30, 35] or anomaly based [46, 8], typically focus 
on inbound packets flows for signs of malicious point-to- 
point intrusion attempts. Network IDSs have the capacity 
to detect initial incoming intrusion attempts, and the pro- 
lific frequency with which they produce such alarms in 
operational networks is well documented [36]. However, 
distinguishing a successful local host infection from the 
daily myriad of scans and intrusion attempts is as critical 
and challenging a task as any facet of network defense. 

Intrusion report correlation enables an analyst to ob- 
tain higher-level interpretations of network sensor alert 
streams, thereby alleviating noise-level issues with tradi- 
tional network IDSs. Indeed, there is significant research 
in the area of consolidating network security alarms into 
coherent incident pictures. One major vein of research 
in intrusion report correlation is that of alert fusion, i.e., 
clustering similar events under a single label [42]. The 
primary goal of fusion is log reduction, and in most sys- 
tems similarity is based upon either attributing multiple 
events to a single threat agent or providing a consoli- 
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dated view of a common set of events that target a single 
victim. The bot infection problem satisfies neither cri- 
terion. The bot infection process spans several diverse 
transactions that occur in multiple directions and poten- 
tially involves several active participants. A more appli- 
cable area of alert correlation research is multistage at- 
tack recognition, in which predefined scenario templates 
capture multiple state transition sequences that may be 
initiated by multiple threat agents [40, 29]. In Section 3 
we discuss why predefined state transition models sim- 
ply do not work well in bot infection monitoring. While 
we argue that bot infections do regularly follow a series 
of specific steps, we find it rare to accurately detect all 
steps, and find it equally difficult to predict the order and 
time-window in which these events are recorded. 


Our Approach: We introduce an “evidence-trail” ap- 
proach to recognizing successful bot infections through 
the communication sequences that occur during the in- 
fection process. We refer to this approach as the infec- 
tion dialog correlation strategy. In dialog correlation, bot 
infections are modeled as a set of loosely ordered com- 
munication flows that are exchanged between an inter- 
nal host and one or more external entities. Specifically, 
we model all bots as sharing a common set of under- 
lying actions that occur during the infection life cycle: 
target scanning, infection exploit, binary egg download 
and execution, command and control channel establish- 
ment, and outbound scanning. We neither assume that 
all these events are required by all bots nor that every 
event will be detected by our sensor alert stream. Rather, 
our dialog correlation system collects an evidence trail of 
relevant infection events per internal host, looking for a 
threshold combination of sequences that will satisfy our 
requirements for bot declaration. 


Our System: To demonstrate our methodology, we 
introduce a passive network monitoring system called 

otHunter, which embodies our infection dialog cor- 
relation strategy. The BotHunter correlator is driven 
by Snort [35] with a customized malware-focused rule- 
set, which we further augment with two additional bot- 
specific anomaly-detection plug-ins for malware analy- 
sis: SLADE and SCADE. SLADE implements a lossy 
n-gram payload analysis of incoming traffic flows, tar- 
geting byte-distribution divergences in selected proto- 
cols that are indicative of common malware intrusions. 
SCADE performs several parallel and complementary 
malware-focused port scan analyses to both incoming 
and outgoing network traffic. The BotHunter correlator 
associates inbound scan and intrusion alarms with out- 
bound communication patterns that are highly indicative 
of successful local host infection. When a sufficient se- 
quence of alerts is found to match BotHunter’s infec- 
tion dialog model, a consolidated report is produced to 
capture all the relevant events and event participants that 


contributed to the infection dialog. 

Contributions: Our primary contribution in this pa- 
per is to introduce a new network perimeter monitor- 
ing strategy, which focuses on detecting malware infec- 
tions (specifically bots/botnets) through IDS-driven dia- 
log correlation. We present an abstraction of the major 
network packet dialog sequences that occur during a suc- 
cessful bot infection, which we call our bot infection di- 
alog model. Based on this model we introduce three bot- 
specific sensors, and our IDS-independent dialog corre- 
lation engine. Ours is the first real-time analysis system 
that can automatically derive a profile of the entire bot 
detection process, including the identification of the vic- 
tim, the infection agent, the source of the egg download, 
and the command and control center.! We also present 
our analysis of BotHunter against more than 2,000 re- 
cent bot infection experiences, which we compiled by 
deploying BotHunter both within a high-interaction hon- 
eynet and through a VMware experimentation platform 
using recently captured bots. We validate our infection 
sequence model by demonstrating how our correlation 
engine successfully maps the network traces of a wide 
variety of recent bot infections into our model. 

The remainder of this paper is outlined as follows. 
In Section 2 we discuss the sequences of communica- 
tion exchanges that occur during a successful bot and 
worm infection. Section 3 presents our bot infection di- 
alog model, and defines the conditions that compose our 
detection requirements. Section 4 presents the BotH- 
unter architecture, and Section 5 presents our experi- 
ments performed to assess BotHunter’s detection perfor- 
mance. Section 6 discusses limitations and future work, 
and Section 7 presents related work. Section 8 discusses 
our Internet release of the BotHunter system, and in Sec- 
tion 9 we summarize our results. 


2 Understanding Bot Infection Sequences 


Understanding the full complexity of the bot infection 
life cycle is an important challenge for future network 
perimeter defenses. From the vantage point of the net- 
work egress position, distinguishing successful bot in- 
fections from the continual stream of background ex- 
ploit attempts requires an analysis of the two-way dia- 
log flow that occurs between a network’s internal hosts 
and the Internet. On a well-administered network, the 
threat of a direct-connect exploit is limited by the extent 
to which gateway filtering is enabled. However, contem- 
porary malware families are highly versatile in their abil- 
ity to attack susceptible hosts through email attachments, 
infected P2P media, and drive-by-download infections. 


‘Our current system implements a classic bot infection dialog 
model. One can define new models in an XML configuration file and 
add new detection sensors. Our correlator is IDS-independent, flexible, 
and extensible to process new models without modification. 
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Furthermore, with the ubiquity of mobile laptops and 
virtual private networks (VPNs), direct infection of an 
internal asset need not necessarily take place across an 
administered perimeter router. Regardless of how mal- 
ware enters a host, once established inside the network 
perimeter the challenge remains to identify the infected 
machine and remove it as quickly as possible. 

For this present study, we focus on a rather narrow 
aspect of bot behavior. Our objective is to understand 
the sequence of network communications and data ex- 
changes that occur between a victim host and other net- 
work entities. To illustrate the stages of a bot infection, 
we outline an infection trace from one example bot, a 
variant of the Phatbot (aka Gaobot) family [4]. Figure 1 
presents a summary of communication exchanges that 
were observed during a local host Phatbot infection. 

As with many common bots that propagate through 
remote exploit injection, Phatbot first (step 1) probes an 
address range in search of exploitable network services 
or responses from Trojan backdoors that may be used to 
enter and hijack the infected machine. If Phatbot receives 
a connection reply to one of the targeted ports on a host, 
it then launches an exploit or logs in to the host using a 
backdoor. In our experimental case, a Windows work- 
station replies to a 135-TCP (MS DCE/RPC) connection 
request, establishing a connection that leads to an imme- 
diate RPC buffer overflow (step 2). Once infected, the 
victim host is directed by an upload shell script to open 
a communication channel back to the attacker to down- 
load the full Phatbot binary (step 3). The bot inserts it- 
self into the system boot process, turns off security soft- 
ware, probes the local network for additional NetBIOS 
shares, and secures the host from other malware that may 
be loaded on the machine. The infected victim next dis- 
tinguishes itself as a bot by establishing a connection to a 
botnet C&C server, which in the case of Phatbot is estab- 
lished over an IRC channel (step 4). Finally, the newly 
infected bot establishes a listen port to accept new binary 
updates and begins scanning other external victims on 
behalf of the botnet (step 5). 


3 Modeling the Infection Dialog Process 


While Figure 1 presents an example of a specific bot, 
the events enumerated are highly representative of the 
life cycle phases that we encounter across the various 
bot families that we have analyzed. Our bot propagation 
model is primarily driven by an assessment of outward- 
bound communication flows that are indicative of behav- 
ior associated with botnet coordination. Where possible, 
we seek to associate such outbound communication pat- 
terns with observed inbound intrusion activity. However, 
this latter activity is not a requirement for bot declaration. 
Neither are incoming scan and exploit alarms sufficient 
to declare a successful malware infection, as we assume 


that a constant stream of scan and exploit signals will be 
observed from the egress monitor. 


We model an infection sequence as a composition 
of participants and a loosely ordered sequence of ex- 
changes: Infection I = < A,V,C,V’,E,D >, where A 
= Attacker, V = Victim, E = Egg Download Location, C 
= C&C Server, and V’ = the Victim’s next propagation 
target. D represents an infection dialog sequence 
composed of bidirectional flows that cross the egress 
boundary. Our infection dialog D is composed of a set 
of five potential dialog transactions (EJ, E2, E3, E4, 
E5), some subset of which may be observed during an 
instance of a local host infection: 


— El: External to Internal Inbound Scan 

— E2: External to Internal Inbound Exploit 

— E3: Internal to External Binary Acquisition 

— E4: Internal to External C&C Communication 

— ES: Internal to External Outbound Infection Scanning 


Figure 2 illustrates our bot infection dialog model 
used for assessing bidirectional flows across the network 
boundary. Our dialog model is similar to the model pre- 
sented by Rajab et al. in their analysis of 192 IRC bot 
instances [33]. However, the two models differ in ways 
that arise because of our specific perspective of egress 
boundary monitoring. For example, we incorporate early 
initial scanning, which is often a preceding observation 
that occurs usually in the form of IP sweeps that tar- 
get a relatively small set of selected vulnerable ports. 
We also exclude DNS C&C lookups, which Rajab et 
al. [33] include as a consistent precursor to C&C co- 
ordination, because DNS lookups are often locally han- 
dled or made through a designated DNS server via inter- 
nal packet exchanges that should not be assumed visible 
from the egress position. Further, we exclude local host 
modifications and internal network propagation because 
these are also events that are not assumed to be visible 
from the egress point. Finally, we include internal-to- 
external attack propagation, which Rajab et al. [33] ex- 
clude. While our model is currently targeted for passive 
network monitoring events, it will be straightforward to 
include localhost-based or DNS-server-based IDSs that 
can augment our dialog model. 

Figure 2 is not intended to provide a strict ordering 
of events, but rather to capture a typical infection dialog 
(exceptions to which we discuss below). In the idealized 
sequence of a direct-exploit bot infection dialog, the bot 
infection begins with an external-to-internal communi- 
cation flow that may encompass bot scanning (E1) or a 
direct inbound exploit (E2). When an internal host has 
been successfully compromised (we observe that many 
compromise attempts regularly end with process dumps 
or system freezes), the newly compromised host down- 
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Figure 1: Phatbot Dialog Summary 


loads and instantiates a full malicious binary instance of 
the bot (E3). Once the full binary instance of the bot is re- 
trieved and executed, our model accommodates two po- 
tential dialog paths, which Rajab et al. [33] refer to as the 
bot Type I versus Type II split. Under Type II bots, the 
infected host proceeds to C&C server coordination (E4) 
before attempting self-propagation. Under a Type I bot, 
the infected host immediately moves to outbound scan- 
ning and attack propagation (E5), representing a classic 
worm infection. 

We assume that bot dialog sequence analysis must be 
robust to the absence of some dialog events, must al- 
low for multiple contributing candidates for each of the 
various dialog phases, and must not require strict se- 
quencing on the order in which outbound dialog is con- 
ducted. Furthermore, in practice we have observed that 
for Type II infections, time delays between the initial in- 
fection events (El and E2) and subsequent outbound di- 
alog events (E3, E4, and E5) can be significant—on the 
order of several hours. Furthermore, our model must be 
robust to failed El and E2 detections, possibly due to in- 
sufficient IDS fidelity or due to malware infections that 
occur through avenues other than direct remote exploit. 

One approach to addressing the challenges of se- 
quence order and event omission is to use a weighted 
event threshold system that captures the minimum nec- 
essary and sufficient sparse sequences of events under 
which bot profile declarations can be triggered. For ex- 
ample, one can define a weighting and threshold scheme 
for the appearance of each event such that a minimum 
set of event combinations is required before bot detec- 
tion. In our case, we assert that bot infection declaration 
requires a minimum of 

Condition 1: Evidence of local host infection (E2), 
AND evidence of outward bot coordination or attack 
propagation (E3-E5); or 

Condition 2: At least two distinct signs of outward 
bot coordination or attack propagation (E3-E5). 
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Figure 2: Bot Infection Dialog Model 


In our description of the BotHunter correlation en- 
gine in Section 4, we discuss a weighted event threshold 
scheme that enforces the above minimum requirement 
for bot declaration. 


4 BotHunter: System Design 


We now turn our attention to the design of a passive mon- 
itoring system capable of recognizing the bidirectional 
warning signs of local host infections, and correlating 
this evidence against our dialog infection model. Our 
system, referred to as BotHunter, is composed of a trio 
of IDS components that monitor in- and out-bound traf- 
fic flows, coupled with our dialog correlation engine that 
produces consolidated pictures of successful bot infec- 
tions. We envision BotHunter to be located at the bound- 
ary of a network, providing it a vantage point to observe 
the network communication flows that occur between the 
network’s internal hosts and the Internet. Figure 3 illus- 
trates the components within the BotHunter package. 

Our IDS detection capabilities are composed on top 
of the open source release of Snort [35]. We take full 
advantage of Snort’s signature engine, incorporating an 
extensive set of malware-specific signatures that we de- 
veloped internally or compiled from the highly active 
Snort community (e.g., [10] among other sources). The 
signature engine enables us to produce dialog warnings 
for inbound exploit usage, egg downloading, and C&C 
patterns, as discussed in Section 4.1.3. In addition, we 
have developed two custom plugins that complement the 
Snort signature engine’s ability to produce certain dialog 
warnings. Note that we refer to the various IDS alarms 
as dialog warnings because we do not intend the individ- 
ual alerts to be processed by administrators in search of 
bot or worm activity. Rather, we use the alerts produced 
by our sensors as input to drive a bot dialog correlation 
analysis, the results of which are intended to capture and 
report the actors and evidence trail of a complete bot in- 
fection sequence. 
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Figure 3: BotHunter System Architecture 


Our two custom BotHunter plugins are called SCADE 
and SLADE. SCADE, discussed in Section 4.1.1, pro- 
vides inbound and outbound scan detection warnings 
that are weighted for sensitivity toward malware-specific 
scanning patterns. SLADE, discussed in Section 4.1.2, 
conducts a byte-distribution payload anomaly detection 
of inbound packets, providing a complementary non- 
signature approach in inbound exploit detection. 


Our BotHunter correlator is charged with maintaining 
an assessment of all dialog exchanges, as seen through 
our sensor dialog warnings, between all local hosts com- 
municating with external entities across the Internet. The 
BotHunter correlator manages the state of all dialog 
warnings produced per local host in a data structure we 
refer to as the network dialog correlation matrix (Fig- 
ure 4). Evidence of local host infection is evaluated and 
expired from our correlator until a sufficient combination 
of dialog warnings (E1—E5) crosses a weighted thresh- 
old. When the bot infection threshold is crossed for a 
given host, we produce a bot infection profile (illustrated 
in Figure 7). 


Finally, our correlator also incorporates a module that 
allows users to report bot infection profiles to a remote 
repository for global collection and evaluation of bot ac- 
tivity. For this purpose, we utilize the Cyber-TA privacy- 
enabled alert delivery infrastructure [32]. Our delivery 
infrastructure first anonymizes all source-local addresses 
reported within the bot infection profile, and then de- 
livers the profile to our data repository through a TLS 
over TOR [15] (onion routing protocol) network connec- 
tion. These profiles will be made available to the research 
community, ideally to help in the large-scale assessment 
of bot dialog behavior, the sources and volume of vari- 
ous bot infections, and for surveying where C&C servers 
and exploit sources are located. 


4.1 A Multiple-Sensor Approach to Gathering In- 
fection Evidence 


4.1.1 SCADE: Statistical sCan Anomaly Detection 
Engine 


Recent measurement studies suggest that modern bots 
are packaged with around 15 exploit vectors on average 
[33] to improve opportunities for exploitation. Depend- 
ing on how the attack source scans its target, we are likely 
to encounter some failed connection attempts prior to a 
successful infection. 

To address this form aspect of malware interaction, 
we have designed SCADE, a Snort preprocessor plug- 
in with two modules, one for inbound scan detection (El 
dialog warnings) and another for detecting outbound at- 
tack propagations (ES dialog warnings) once our local 
system is infected. SCADE EI alarms provide a poten- 
tial early bound on the start of an infection, should this 
scan eventually lead to a successful infection. 

Inbound Scan Detection: SCADE is similar in prin- 
ciple to existing scan detection techniques like [35, 24]. 
However, SCADE has been specifically weighted toward 
the detection of scans involving the ports often used by 
malware. It is also less vulnerable to DoS attacks be- 
cause its memory trackers do not maintain per-source-IP 
state. Similar to the scan detection technique proposed 
in [48], SCADE tracks only scans that are specifically 
targeted to internal hosts, bounding its memory usage 
to the number of inside hosts. SCADE also bases its 
EI scan detection on failed connection attempts, further 
narrowing its processing. We define two types of ports: 
Hs (high-severity) ports representing highly vulnera- 
ble and commonly exploited services (e.g., 80/HTTP, 
135,1025/DCOM, 445/NetBIOS, 5000/UPNP, 3127/My- 
Doom) and Ls (low-severity) ports.” Currently, we define 


Based on data obtained by analyzing vulnerability reports, mal- 
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26 TCP and 4 UDP ks ports and mark all others as Ls 
ports. We set different weights to a failed scan attempt 
to different types of ports. An El dialog warning for a 
local host is produced based on an anomaly score that is 
calculated as s = wi Fh, + wofis, where F},, and Fi, 
indicate numbers of cumulative failed attempts at high- 
severity and low-severity ports, respectively. 

Outbound Scan Detection: SCADE’s outbound scan 
detection coverage for E5 dialog warnings is based on a 
voting scheme (AND, OR or MAJORITY) of three par- 
allel anomaly detection models that track all external out- 
bound connections per internal host: 

e Outbound scan rate (s,): Detects local hosts that con- 
duct high-rate scans across large sets of external ad- 
dresses. 

e Outbound connection failure rate (s2): Detects ab- 
normally high connection fail rates, with sensitivity to 
HS port usage. We calculate the anomaly score sg = 
(wifhs + woFis)/C, where C is the total number of 
scans from the host within a time window. 

e Normalized entropy of scan target distribution (s3): 
Calculates a Zipf (power-law) distribution of outbound 
address connection patterns [3]. A uniformly distributed 
scan target pattern provides an indication of a potential 
outbound scan. We use an anomaly scoring technique 
based on normalized entropy to identify such candidates: 
$3 = tay? where the entropy of scan target distribution 
is H = — yo p; n(p;), m is the total number of scan 
targets, and p; is the percentage of the scans at target 7. 
Each anomaly module issues a subalert when s; > 1;, 
where ¢; is a threshold. SCADE then uses a user- 
configurable “voting scheme”, i.e., AND, OR, or MA- 
JORITY, to combine the alerts from the three modules. 
For example, the AND rule dictates that SCADE issues 
an alert when all three modules issue an alert. 


4.1.2 SLADE: Statistical PayLoad Anomaly 
Detection Engine 


SLADE is an anomaly-based engine for payload ex- 
ploit detection. It examines the payload of every request 
packet sent to monitored services and outputs an alert if 
its lossy n-gram frequency deviates from an established 
normal profile. 

SLADE is similar to PAYL [46], which is a 
payload-based 1-gram byte distribution anomaly detec- 
tion scheme. PAYL examines the 1-gram byte distri- 
bution of the packet payload, ie., it extracts 256 fea- 
tures each representing the occurrence frequency of one 
of the 256 possible byte values in the payload. A nor- 
mal profile for a service/port, e.g., HTTP, is constructed 
by calculating the average and standard deviation of the 
feature vector of the normal traffic to the port. PAYL 


ware infection vectors and analysis reports of datasets collected at 
Dshield.org and other honeynets. 


calculates deviation distance of a test payload from the 
normal profile using a simplified Mahalanobis distance, 
d(x,y) = (ate — yil)/(o; + a), where y; is the 
mean, o; is the standard deviation, and a@ is a smooth- 
ing factor. A payload is considered as anomalous if this 
distance exceeds a predetermined threshold. PAYL is ef- 
fective in detecting worm exploits with a reasonable false 
positive rate as shown in [46, 47]. However, it could be 
evaded by a polymorphic blending attack (PBA) [18]. As 
discussed in [47, 18, 31], a generic n-gram version of 
PAYL may help to improve accuracy and the hardness 
of evasion. The n-grams extract n-byte sequence infor- 
mation from the payload, which helps in constructing a 
more precise model of the normal traffic compared to the 
single-byte (i.e., 1-gram) frequency-based model. In this 
case the feature space in use is not 256, but 256” dimen- 
sional. It is impractical to store and compute in a 256” 
dimension space for high-n-grams. 

SLADE makes the n-gram scheme practical by using a 
lossy structure while still maintaining approximately the 
same accuracy as the original full n-gram version. We 
use a fixed vector counter (with size v) to store a lossy 
n-gram distribution of the payload. When processing a 
payload, we sequentially scan n-gram substring str, ap- 
ply some universal hash function h(), and increment the 
counter at the vector space indexed by h(str) mod v. 
We then calculate the distribution of the hashed n-gram 
indices within this (much) smaller vector space v. We 
define F as the feature space of n-gram PAYL (with a to- 
tal of 256” distinct features), and F’ as the feature space 
of SLADE (with v features). 

This hash function provides a mapping from F to F’ 
that we utilize for space efficiency. We require only v 
(e.g., v = 2,000), whereas n-gram PAYL needs 256” 
(e.g., even for a small n=3, 256° = 274 = 16M). The 
computational complexity in examining each payload is 
still linear (O(L), where L is the length of payload), and 
the complexity in calculating distance is O(v) instead 
of 256”. Thus, the runtime performance of SLADE is 
comparable to 1-gram PAYL. Also note that although 
both use hashing techniques, SLADE is different from 
Anagram [45], which uses a Bloom filter to store all n- 
gram substrings from normal payloads. The hash func- 
tion in SLADE is for feature compression and reduction, 
however the hash functions in Anagram are to reduce 
the false positives of string lookup in Bloom filter. In 
essence, Anagram is like a content matching scheme. It 
builds a huge knowledge base of all known good n-gram 
substrings using efficient storage and query optimiza- 
tions provided by bloom filters, and examines a payload 
to determine whether the number of its n-gram substrings 
not in the knowledge base exceeds a threshold. 

A natural concern of using such a lossy data structure 
is the issue of accuracy: how many errors (false pos- 





172 


16th USENIX Security Symposium 


USENIX Association 


itives and false negatives) may be introduced because 
of the lossy representation? To answer this question, 
we perform the following simple analysis.* Let us first 
overview the reason why the original n-gram PAYL can 
detect anomalies. We use y to represent the number of 
non-zero value features in F for a normal profile used 
by PAYL. Similarly, y’ is the number of non-zero value 
features in F’ for a normal profile used by SLADE. For 
a normal payload of length = L, there is a total of 
1 = (L—n-+1) n-gram substrings. Among these / sub- 
strings, 1 — @,, percent substrings converge to y distinct 
features in the normal profile, i.e., these substrings share 
similar distributions as the normal profile. The remaining 
(small portion) 3,, percent of substrings are considered 
as noise substrings that do not belong to the + features 
in the normal profile. For a malicious payload, if it can 
be detected as an anomaly, it should have a much larger 
portion of noise substrings Ga (Ga > Bn). 

We first analyze the false positives when using the 
lossy structure representation to see how likely SLADE 
will detect a normal (considered normal by n-gram 
PAYL) payload as anomalous. For a normal payload, the 
hashed indices of a 1 — @,, portion of substrings (that 
converge to y distinct features in F for the normal pro- 
file of PAYL) should now converge in the new vector 
space (into 7’ distinct features in F’ for the normal pro- 
file of SLADE). Because of the universal hash function, 
hashed indices of the 3, portion of noise substrings are 
most likely uniformly distributed into F’. As a result, 
some of the original noise substrings may actually be 
hashed to the y’ distinct features in the normal profile 
of SLADE (i.e., they may not be noise in the new fea- 
ture space now). Thus, the deviation distance (i.e., the 
anomaly score) can only decrease in SLADE. Hence, we 
conclude that SLADE may not have a higher false posi- 
tive rate than n-gram PAYL. 

Now let us analyze the false negative rate, i.e., the like- 
lihood that SLADE will treat a malicious payload (as 
would be detected by n-gram PAYL) as normal. False 
negatives happen when the hash collisions in the lossy 
structure mistakenly map a 3, portion of noise substrings 
into the 7’ features (i.e., the normal profile) for SLADE. 
By using the universal hash function, the probability for 
a noise substring to fall into 7’ out of v space is = Thus, 
the probability for all the 1G, noise substrings to collide 
into the -y’ portion is about ($a, For example, if we 
assume v = 2,000, 7’ = 200,16, = 100, then this prob- 
ability is about (200/2000)1°9 = le — 100 = 0. In prac- 
tice, the probability of such collisions for partial noise 
substrings is negligible. Thus, we believe that SLADE 
does not incur a significant accuracy penalty compared to 


3We consider our analysis not as an exact mathematical proof, but 
an analytical description about the intuition behind SLADE. 


full n-gram PAYL, while significantly reducing its stor- 
age and computation complexity. 

We measured the performance of SLADE in compar- 
ison to 1-gram PAYL by using the same data set as in 
[31]. The training and test data sets used were from 
the first and following four days of HTTP requests from 
the Georgia Tech campus network, respectively. The at- 
tack data consists of 18 HTTP-based buffer overflow at- 
tacks, including 11 regular (nonpolymorphic) exploits, 
6 mimicry exploits generated by CLET, and 1 polymor- 
phic blending attack used in [18] to evade 2-gram PAYL. 
In our experiment, we set n = 4, v = 2,048.4 

Table 1 summarizes our experimental results. Here, 
DFP is the desired false positive rate, i.e., the rejection 
rate in the training set. RFP is the “real” false positive 
rate in our test data set. The detection rate is measured 
on the attack data set and is defined as the number of at- 
tack packets classified as anomalous divided by the total 
number of packets in the attack instances. We conclude 
from the results that SLADE performs better with respect 
to both DFP and RFP than the original PAYL (1-gram) 
system. Furthermore, we discovered that the minimum 
RFP for which PAYL is able to detect all attacks, includ- 
ing the polymorphic blending attack, is 4.02%. This is 
usually considered intolerably high for network intrusion 
detection. On the other hand, the minimum RFP required 
for SLADE to detect all attacks is 0.3601%. As shown 
in [31], 2-gram PAYL does not detect the polymorphic 
blending attack even if we are willing to tolerate an RFP 
as high as 11.25%. This is not surprising given that the 
polymorphic blending attack we used was specifically 
tailored to evade 2-gram PAYL. We also find that SLADE 
is comparable to (or even better than) a well-constructed 
ensemble IDS that combines 11 one-class SVM classi- 
fiers [31], and detects all the attacks, including the poly- 
morphic blending attack, for an RFP at around 0.49%. 
SLADE also has the added advantage of more efficient 
resource utilization, which results in shorter training and 
execution times when compared to the ensemble IDS. 


4.1.3 Signature Engine: Bot-Specific Heuristics 


Our final sensor contributor is the Snort signature engine. 
This module plays a significant role in detecting several 
of the classes of dialog warnings from our bot infection 
dialog model. Snort is our second sensor source for di- 
rect exploit detection (class E2), and our primary source 
for binary downloading (E3) and C&C communications 
(E4). We organize the rules selected for BotHunter into 
four separate rule files, covering 1046 E2 rules, 71 E3 
rules, 246 E4 rules, and a small collection of 20 E5 rules, 
for total of 1383 heuristics. The rules are primarily de- 


4One can also choose a random v to better defeat evasion attacks 
like PBA. Also one may use multiple different hash functions and vec- 
tors for potential better accuracy and hardness of evasion. 
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Table 1: Performance of 1-gram PAYL and SLADE 




















DFP(%) 0.0 0.01 0.1 1.0 2.0 5.0 10.0 
PAYL RFP(%) 0.00022 | 0.01451 | 0.15275 | 0.92694 | 1.86263 | 5.69681 | 11.05049 
Detected Attacks 1 4 17 17 17 18 18 
Detection Rate(%) | 0.8 17.5 69.1 12:2. 72.2 73.8 78.6 
SLADE — RFP(%) 0.0026 0.0189 0.2839 1.9987 3.3335 6.3064 11.0698 
Detected Attacks 3 13 17 18 18 18 18 
Detection Rate(%) | 20.6 74.6 92.9 99.2 99.2 99.2 99.2 























rived from the Bleeding-Edge [10] and SourceFire’s reg- 
istered free rulesets. 

All the rulesets were selected specifically for their rel- 
evance to malware identification. Our rule selections are 
continually tested and reviewed across operational net- 
works and our live honeynet environment. It is typical 
for our rule-based heuristics to produce less than 300 
dialog warnings per 10-day period monitoring an oper- 
ational border switch space port of approximately 130 
operational hosts (SRI Computer Science Laboratory). 

Our E2 ruleset focuses on the full spectrum of external 
to internal exploit injection attacks, and has been tested 
and augmented with rules derived from experimentation 
in our medium and high interactive honeynet environ- 
ment, where we can observe and validate live malware 
infection attempts. Our E3 rules focus on (malware) 
executable download events from external sites to in- 
ternal networks, covering as many indications of (ma- 
licious) binary executable downloads and download ac- 
knowledgment events as are in the publicly available 
Snort rulesets. Our E4 rules cover internally-initiated 
bot command and control dialog, and acknowledgment 
exchanges, with a significant emphasis on IRC and URL- 
based bot coordination.» Also covered are commonly 
used Trojan backdoor communications, and popular bot 
commands built by keyword searching across common 
major bot families and their variants. A small set of E5 
rules is also incorporated to detect well-known internal 
to external backdoor sweeps, while SCADE provides the 
more in-depth hunt for general outbound port scanning. 


4.2 Dialog-Based IDS Correlation Engine 


The BotHunter correlator tracks the sequences of IDS 
dialog warnings that occur between each local host 
and those external entities involved in these dialog ex- 
changes. Dialog warnings are tracked over a temporal 
window, where each contributes to an overall infection 
sequence score that is maintained per local host. We in- 
troduce a data structure called the network dialog corre- 
lation matrix, which is managed, pruned, and evaluated 
by our correlation engine at each dialog warning inser- 
tion point. Our correlator employs a weighted thresh- 
old scoring function that aggregates the weighted scores 


5B4 rules are essentially protocol, behavior and payload content sig- 
nature, instead of a hard-coded known C'&C domain list. 


of each dialog warning, declaring a local host infected 
when a minimum combination of dialog transactions oc- 
cur within our temporal pruning interval. 


Figure 4 illustrates the structure of our network dialog 
correlation matrix. Each dynamically-allocated row cor- 
responds to a summary of the ongoing dialog warnings 
that are raised between an individual local host and other 
external entities. The BotHunter correlator manages the 
five classes of dialog warnings presented in Section 3 (E1 
through E5), and each event cell corresponds to one or 
more (possibly aggregated) sensor alerts that map into 
one of these five dialog warning classes. This correlation 
matrix dynamically grows when new activity involving a 
local host is detected, and shrinks when the observation 
window reaches an interval expiration. 

In managing the dialog transaction history we employ 
an interval-based pruning algorithm to remove old di- 
alog from the matrix. In Figure 4, each dialog may 
have one or two expiration intervals, corresponding to 
a soft prune timer (the open-faced clocks) and a hard 
prune timer (the filled clocks). The hard prune inter- 
val represents a fixed temporal interval over which di- 
alog warnings are allowed to aggregate, and the end of 
which results in the calculation of our threshold score. 
The soft prune interval represents a smaller temporal 
window that allows users to configure tighter pruning 
interval requirements for high-production dialog warn- 
ings (inbound scan warnings are expired more quickly 
by the soft prune interval), while the others are allowed 
to accumulate through the hard prune interval. If a dia- 
log warning expires solely because of a soft prune timer, 
the dialog is summarily discarded for lack of sufficient 
evidence (an example is row | in Figure 4 where only El 
has alarms). However, if a dialog expires because of a 
hard prune timer, the dialog threshold score is evaluated, 
leading either to a bot declaration or to the complete re- 
moval of the dialog trace should the threshold score be 
found insufficient. 

To declare that a local host is infected, BotHunter 
must compute a sufficient and minimum threshold of ev- 
idence (as defined in Section 3) within its pruning inter- 
val. BotHunter employs two potential criteria required 
for bot declaration: 1) an incoming infection warning 
(E2) followed by outbound local host coordination or ex- 
ploit propagation warnings (E3-E5), or 2) a minimum of 
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Figure 4: BotHunter Network Dialog Correlation Matrix 


at least two forms of outbound bot dialog warnings (E3- 
E5). To translate these requirements into a scoring al- 
gorithm we employ a regression model to estimate di- 
alog warning weights and a threshold value, and then 
test our values against a corpus of malware infection 
traces. We define an expectation table of predictor vari- 
ables that match our conditions and apply a regression 
model where the estimated regression coefficients are the 
desired weights shown in Table 2. For completeness, 
the computed expectation table is provided in the project 
website [1]. 























| Coefficients | Standard Error 
| El 0.09375 | 0.100518632 
| E2 rulebase 0.28125 | 0.075984943 
| E2 slade 0.09375 | 0.075984943 
[ B3 0.34375 | 0.075984943 
| E4 0.34375 | 0.075984943 
| E5 0.34375 | 0.075984943 














Table 2: Initial Weighting 


These coefficients provide an approximate weighting 
system to match the initial expectation table °. We apply 
these values to our expectation table data to establish a 
threshold between bot and no-bot declaration. Figure 5 
illustrates our results, where bot patterns are at X-axis 
value 1, and non-bot patterns are at X-axis 0. Bot scores 
are plotted vertically on the Y-axis. We observe that all 
but one non-bot patterns score below 0.6, and all but 2 
bot patterns score above 0.65. Next, we examine our 
scoring model against a corpus of BotHunter IDS warn- 
ing sets produced from successful bot and worm infec- 
tions captured in the SRI honeynet between March and 
April 2007. Figure 6 plots the actual bot scores produced 
from these real bot infection traces. All observations pro- 
duce BotHunter scores of 0.65 or greater. 

When a dialog sequence is found to cross the thresh- 
old for bot declaration, BotHunter produces a bot pro- 
file. The bot profile represents a full analysis of roles 


®In our model, we define El scans and the E2 anomaly score (pro- 
duced by Slade) as increasers to infection confidence, such that our 
model lowers their weight influence. 


of the dialog participants, summarizes the dialog alarms 
based on which dialog classes (E1-E5) the alarms map, 
and computes the infection time interval. Figure 7 (right) 
provides an example of a bot profile produced by the 
BotHunter correlation engine. The bot profile begins 
with an overall dialog anomaly score, followed by the IP 
address of the infected target (the victim machine), infec- 
tor list, and possible C&C server. Then it outputs the di- 
alog observation time and reporting time. The raw alerts 
specific to this dialog are listed in an organized (E1-E5) 
way and provide some detailed information. 


5 Evaluating Detection Performance 


To evaluate BotHunter’s performance, we conducted sev- 
eral controlled experiments as well as real world deploy- 
ment evaluations. We begin this section with a discus- 
sion of our detection performance while exposing BotH- 
unter to infections from a wide variety of bot fami- 
lies using in situ virtual network experiments. We then 
discuss a larger set of true positive and false negative 
results while deploying BotHunter to a live VMWare- 
based high-interaction honeynet. This recent experi- 
ment exposed BotHunter to 2,019 instances of Windows 
XP and Windows 2000 direct-exploit malware infections 
from the Internet. We follow these controlled experi- 
ments with a brief discussion of an example detection 
experience using BotHunter during a live operational de- 
ployment. 

Next, we discuss our broader testing experiences in 
two network environments. Here, our focus is on un- 
derstanding BotHunter’s daily false positive (FP) perfor- 
mance, at least in the context of two significantly dif- 
ferent operational environments. A false positive in this 
context refers to the generation of a bot profile in re- 
sponse to a non-infection traffic flow, not to the number 
of IDS dialog warnings produced by the BotHunter sen- 
sors. As stated previously, network administrators are 
not expected to analyze individual IDS alarms. Indeed, 
we anticipate external entities to regularly probe and at- 
tack our networks, producing a regular flow of dialog 
warnings. Rather, we assert (and validate) that the dialog 
combinations necessary to cause a bot detection should 
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Figure 5: Scoring Plot from Expectation Table 
be rarely encountered in normal operations. 


5.1 Experiments in an Jn situ Virtual Network 


Our evaluation setup uses a virtual network environment 
of three VMware guest systems. The first is a Linux 
machine with IRC server installed, which is used as the 
C&C server, and the other two are Windows 2000 in- 
stances. We infect one of the Windows instances and 
wait for it to connect to our C&C server. Upon connec- 
tion establishment, we instruct the bot to start scanning 
and infecting neighboring hosts. We then await the in- 
fection and IRC C’&C channel join by the second Win- 
dows instance. By monitoring the network activity of 
the second victim, we capture the full infection dialog. 
This methodology provides a useful means to measure 
the false negative performance of BotHunter. 

We collected 10 different bot variants from three of the 
most well-known IRC-based bot families [20]: Agobot/ 
Gaobot/Phatbot, SDBot/RBot/UrBot/UrXBot, and the 
mIRC-based GTbot. We then ran BotHunter in this vir- 
tual network and limited its correlation focus on the vic- 
tim machine (essentially we assume the HOMENET is 
the victim’s IP). BotHunter successfully detected all bot 
infections (and produced bot profiles for all). 

We summarize our measurement results for this vir- 
tual network infection experiment in Table 3. We use 
Yes or No to indicate whether a certain dialog warning 
is reported in the final profile. The two numbers within 
brackets are the number of generated dialog warnings in 
the whole virtual network and the number involving our 
victim, respectively. For example, for Phatbot-rls, 2,834 
dialog warnings are generated by E2[rb] ([rb] means 
Snort rule base, [sl] means SLADE), but only 46 are rel- 
evant to our bot infection victim. Observe that although 
many warnings are generated by the sensors, only one 
bot profile is generated for this infection. This shows that 
BotHunter can significantly reduce the amount of infor- 
mation a security administrator needs to analyze. In our 
experiments almost all sensors worked as we expected. 
We do not see El events for RBot because the RBot fam- 
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ily does not provide any commands to trigger a vertical 
scan for all infection vectors (such as the “scan.startall”’ 
command provided by the Agobot/Phatbot family). The 
bot master must indicate a specific infection vector and 
port for each scan. We set our initial infection vector to 
DCOM, and since this was successful the attacking host 
did not attempt further exploits. 


Note that two profiles are reported in the gt-with-dcom 
case. In the first profile, only E2[rb],E2[sl] and E4 are 
observed. In profile 2, E4 and E5 are observed (which 
is the case where we miss the initial infection periods). 
Because this infection scenario is very slow and lasts 
longer than our 4-minute correlation time window. Fur- 
thermore, note that we do not have any detected E3 di- 
alog warnings reported for this infection sequence. Re- 
gardless, BotHunter successfully generates an infection 
profile. This demonstrates the utility of BotHunter’s 
evidence-trail-based dialog correlation model. We also 
reran this experiment with a 10-minute correlation time 
window, upon which BotHunter also reported a single 
infection profile. 


5.2 SRI Honeynet Experiments 


Our experimental honeynet framework has three integral 
components. The first component Drone manager is a 
software management component that is responsible for 
keeping track of drone availability and forwarding pack- 
ets to various VMware instances. The address of one of 
the interfaces of this Intel Xeon 3 GHz dual core system 
is set to be the static route for the unused /17 network. 
The other interface is used for communicating with the 
high-interaction honeynet. Packet forwarding is accom- 
plished using network address translation. One impor- 
tant requirements for this system is to keep track of in- 
fected drone systems and to recycle uninfected systems. 
Upon detecting a probable infection (outbound connec- 
tions), we mark the drone as “tainted” to avoid reas- 
signing that host to another source. Tainted drones are 
saved for manual analysis or automatically reverted back 
to previous clean snapshots after a fixed timeout. One 





176 


16th USENIX Security Symposium 


USENIX Association 


Table 3: Dialog Summary of Virtual Network Infections 


















































Fl E2[rb] E2{[sl] E3 E4 E5 
agobot3-priv4 Yes(2/2) Yes(9/8) Yes(6/6) Yes(5) Yes(38/8) Yes(4/1) 
phat-alpha5 Yes(14/4) | Yes(5,785/5,721) | Yes(6/2) | Yes(3/3) | Yes(28/26) | Yes(4/2) 
phatbot-rls Yes(11/3) Yes(2,834/46) Yes(6/2) | Yes(8/8) | Yes(69/20) | Yes(6/2) 
rbot0.6.6 No(0) Yes(2/1) Yes(2/1) | Yes(2/2) | Yes(65/24) | Yes(2/1) 
rxbot7.5 No(0) Yes(2/2) Yes(2/2) | Yes(2/2) | Yes(70/27) | Yes(2/1) 
rx-asn-2-re-workedv2 No(0) Yes(4/3) Yes(3/2) | Yes(2/2) Yes(59/18) Yes(2/1) 
Rxbot-ak-0.7-Modded.by.Uncanny No(0) Yes(3/2) Yes(3/2) | Yes(2/2) Yes(73/26) Yes(2/1) 
sxtbot6.5 No(0) Yes(3/2) Yes(3/2) | Yes(2/2) | Yes(65/24) | Yes(2/1) 
Urx-Special-Ed-UltrA-2005 No(0) Yes(3/2) Yes(3/2) | Yes(2/2) | Yes(68/22) | Yes(2/1) 
gt-with-dcom-profile1 No(1/0) Yes(5/3) Yes(6/2) No(0) Yes(221/1) No(4/0) 
gt-with-dcom-profile2 No(1/0) No(5/0) No(6/0) No(0) Yes(221/44) | Yes(4/2) 
gt-with-dcom-10min-profile No(1/0) Yes(5/3) Yes(6/3) No(0) Yes(221/51) | Yes(4/2) 




















of the interesting observations during our study was that 
most infection attempts did not succeed even on com- 
pletely unpatched Windows 2000 and Windows XP sys- 
tems. As a result, a surprisingly small number of VM 
instances was sufficient to monitor the sources contact- 
ing the entire /17 network. The second component is the 
high-interaction-honeynet system, which is hosted in a 
high-performance Intel Xeon 3 GHz dual core, dual CPU 
system with 8 GB of memory. For the experiments listed 
in this paper, we typically ran the system with 9 Win- 
XP instances, 14 Windows 2000 instances (with two dif- 
ferent service pack levels), and 3 Linux FC3 instances. 
The system was moderately utilized in this load. The fi- 
nal component is the DNS/DHCP server, which dynami- 
cally assigns IP addresses to VMware instances and also 
answers DNS queries from these hosts. 


Over a 3-week period between March and April 2007, 
we analyzed a total of 2,019 successful WinXP and 
Win2K remote-exploit bot or worm infections. Each 
malware infection instance succeeded in causing the hon- 
eypot to initiate outbound communications related to the 
infection. Through our analysis of these traces using 
BotHunter sensor logs, we were able to very reliably ob- 
serve the malware communications associated with the 
remote-to-local network service infection and the mal- 
ware binary acquisition (egg download). In many in- 
stances we also observed the infected honeypot proceed 
to establish C&C communications and attempt to prop- 
agate to other victims in our honeynet. Through some 
of this experiment, our DNS service operated unreliably 
and some C&C coordination events were not observed 
due to DNS lookup failures. 

Figure 7 illustrates a sample infection that was de- 
tected using the SRI honeynet, and the corresponding 
BotHunter profile. W32/IRCBot-TO is a very recent (re- 
leased January 19, 2007) network worm/bot that propa- 
gates through open network shares and affects both Win- 
dows 2000 and Windows XP systems [37]. The worm 
uses the IPC share to connect to the pipe and 
leverages the MS06-40 exploit [27], which is a buffer 


overflow that enables attackers to craft RPC requests that 
can execute arbitrary code. This mechanism is used to 
force the victim to fetch and execute a binary named ne- 
tadp.exe from the system folder. The infected system 
then connects to the z3net IRC network and joins two 
channels upon which it is instructed to initiate scans of 
203.0.0.0/8 network on several ports. Other bot families 
successfully detected by BotHunter included variants of 
W32/Korgo, W32/Virut.A and W32/Padobot. 

Overall, BotHunter detected a total of 1,920 of these 
2,019 successful bot infections. This represents a 95.1% 
true positive rate. All malware instances observed dur- 
ing this period transmitted their exploits through ports 
TCP-445 or TCP-139. This is very common behavior, as 
the malware we observe tends to exploit the first vulner- 
able port that replies to a targeted scans, and ports TCP- 
445 and TCP-139 are usually among the first ports tested. 
The infection set analyzed exhibited limited diversity in 
the infection transmission methods, and overall we ob- 
served roughly 40 unique patterns in the dialog warnings 
produced. 

This experiment produced 99 bot infections that did 
not produce bot profiles, i.e., a 4.9% false negative rate. 
To explain these occurrences we manually examined 
each bot infection trace that eluded BotHunter, using tcp- 
dump and ethereal. The reasons for these failed bot de- 
tections can be classified into three primary categories: 
infection failures, honeynet setup or policy failures, or 
data corruption failures. 

e Infection failures: We observed infections in which 
the exploit apparently led to instability and eventual fail- 
ure in the infected host. More commonly, we observed 
cases in which the infected victim attempt to “phone 
home,” but the SYN request received no reply. 

e Honeynet setup and policy failures: We observed 
that our NAT mechanism did not correctly translate 
application-level address requests (e.g., ftp PORT com- 
mands). This prevented several FTP egg download con- 
nection requests from proceeding, which would have 
otherwise led to egg download detections. In addition, 
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some traces were incomplete due to errors in our honey- 
pot recycling logic which interfered with our observation 
of the infection logic. 

e Data corruption failures: Data corruption was the 
dominant reason (86% of the failed traces) in preventing 
our BotHunter sensors from producing dialog warnings. 
We are still investigating the cause behind these corrup- 
tions, but suspect that these likely happened during log 
rotations by the Drone manager. 

Discussion: In addition to the above false negative ex- 
periences, we also recognize that others reasons could 
prevent BotHunter from detecting infections. A natural 
extension of our infection failures is for a bot to pur- 
posely lay dormant once it has infected a host to avoid 
association of the infection transmission with an out- 
bound egg download or coordination event. This strategy 
could be used successfully to circumvent BotHunter de- 
ployed with our default fixed pruning interval. While we 
found some infected victims failed to phone home, we 
could also envision the egg download source eventually 
responding to these requests after the BotHunter prun- 
ing interval, causing a similar missed association. Sen- 
sor coverage is of course another fundamental concern 
for any detection mechanism. Finally, while these re- 
sults are highly encouraging, the honeynet environment 
provided a low-diversity in bot infections, in which at- 
tention was centered on direct exploits of TCP-445 and 
TCP-139. We did not provide a diversity of honeypots 
with various OSs, vulnerable services, or Trojan back- 
doors enabled, to fully examine the behavioral complex- 
ities of bots or worms. 


5.3. An Example Detection in a Live Deployment 


In addition to our laboratory and honeynet experiences, 
we have also fielded BotHunter to networks within the 
Georgia Tech campus network and within an SRI lab- 
oratory network. In the next sections we will discuss 
these deployments and our efforts to evaluate the false 
positive performance of BotHunter. First, we will briefly 
describe one example host infection that was detected us- 
ing BotHunter within our Georgia Tech campus network 
experiments. 

In early February 2007, BotHunter detected a bot in- 
fection that produced El, E4 and ES dialog warnings. 
Upon inspection of the bot profile, we observed that the 
bot-infected machine was scanned, joined an IRC chan- 
nel, and began scanning other machines during the BotH- 
unter time window. One unusual element in this experi- 
ence was the omission of the actual infection transmis- 
sion event (E2), which is observed with high-frequency 
in our live honeynet testing environment. We assert that 
the bot profile represents an actual infection because dur- 
ing our examination of this infection report, we discov- 
ered that the target of the E4 (C&C Server) dialog warn- 


ing was an address that was blacklisted both by the Shad- 
owServer and the botnet mailing list as a known C&C 
server during the time of our bot profile. 


5.4 Experiments in a University Campus Network 


In this experiment, we evaluate the detection and false 
positive performance of BotHunter in a production cam- 
pus network (at the College of Computing [CoC] at 
Georgia Tech). The time period of this evaluation was 
between October 2006 and February 2007. 

The monitored link exhibits typical diurnal behavior 
and a sustained peak traffic of over 100 Mbps during the 
day. While we were concerned that such traffic rates 
might overload typical NIDS rulesets and real-time de- 
tection systems, our experience shows that it is possible 
to run BotHunter live under such high traffic rates us- 
ing commodity PCs. Our BotHunter instance runs on 
a Linux server with an Intel Xeon 3.6 GHz CPU and 6 
GB of memory. The system runs with average CPU and 
memory utilization of 28% and 3%, respectively. 

To evaluate the representativeness of this traffic, we 
randomly sampled packets for analysis (about 40 min- 
utes). The packets in our sample, which were almost 
evenly distributed between TCP and UDP, demonstrated 
wide diversity in protocols, including popular protocols 
such as HTTP, SMTP, POP, FTP, SSH, DNS, and SNMP, 
and collaborative applications such as IM (e.g., ICQ, 
AIM), P2P (e.g., Gnutella, Edonkey, bittorrent), and 
IRC, which share similarities with infection dialog (e.g., 
two-way communication). We believe the high volume 
of background traffic, involving large numbers of hosts 
and a diverse application mix, offers an appealing en- 
vironment to confirm our detection performance, and to 
examine the false positive question. 

First, we evaluated the detection performance of 
BotHunter in the presence of significant background traf- 
fic. We injected bot traffic captured in the virtual network 
(from the experiments described in Section 5.1) into the 
captured Georgia Tech network traffic. Our motivation 
was to simulate real network infections for which we 
have the ground truth information. In these experiments, 
BotHunter correctly detected all 10 injected infections 
(by the 10 bots described in Section 5.1). 

Next, we conducted a longer-term (4 months) eval- 
uation of false alarm production. Table 4 summarizes 
the number of dialog warnings generated by BotHunter 
for each event type from October 2006 to January 2007. 
BotHunter sensors generated about 2,563,402 (more than 
20,000 per day) raw dialog warnings from all the five 
event categories. For example, many E3 dialog warn- 
ings report on Windows executable downloads, which 
by themselves do not shed light on the presence of ex- 
ploitable vulnerabilities. However, our experiments do 
demonstrate that the alignment of the bot detection con- 
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Table 4: Raw Alerts of BotHunter in 4 Month Operation in CoC Network 





Event E1 E2[rb] 


E2[sl] 


K3 E4 E5 





Alert# || 550,373 

















950,112 


316,467 














1,013 | 697,374 | 48,063 





ditions outline in Section 3 rarely align within a stream 
of dialog warnings from normal traffic patterns. In fact, 
only 98 profiles were generated in 4 months, less than 
one per day on average. 

In further analyzing these 98 profiles, we had the fol- 
lowing findings. First, there are no false positives re- 
lated to any normal usage of collaborative applications 
such as P2P, IM, or IRC. Almost two-thirds (60) of the 
bot profiles involved access to an MS-Exchange SMB 
server (33) and SMTP server (27). In the former case, 
the bot profiles described a NETBIOS SMB-DS IPC$ 
unicode share access followed by a windows executable 
downloading event. Bleeding Edge Snort’s IRC rules are 
sensitive to some IRC commands (e.g., USER) that fre- 
quently appear in the SMTP header. These issues could 
easily be mitigated by additional whitelisting of certain 
alerts on these servers. The remaining profiles contained 
mainly two event types and with low overall confidence 
scores. Additional analysis of these incidents was com- 
plicated by the lack of full packet traces in our high- 
speed network. We can conservatively assume that they 
are false positives and thus our experiments here provide 
a reasonable estimate of the upper bound on the number 
of false alarms (less than one) in a busy campus network. 


5.5 Experiments in an Institutional Laboratory 


We deployed BotHunter live on a small well- 
administered production network (a lightly used /17 net- 
work that we can say with high confidence is infection 
free). Here, we describe our results from running BotH- 
unter in this environment. Our motivation for conducting 
this experiment was to obtain experience with false posi- 
tive production in an operational environment, where we 
could also track all network traces and fully evaluate the 
conditions that may cause the production of any unex- 
pected bot profiles. 

BotHunter conducted a 10-day data stream monitor- 
ing test from the span port position of an egress border 
switch. The network consists of roughly 130 active IP 
addresses, an 85% Linux-based host population, and an 
active user base of approximately 54 people. During this 
period, 182 million packets were analyzed, consisting 
of 152 million TCP packets (83.5%), 15.8 million UDP 
packets (8.7%), and 14.1 million ICMP packets (7.7%). 
Our BotHunter sensors produced 5,501 dialog warnings, 
composed of 1,378 El scan events, 20 E2 exploit sig- 
nature events, 193 E3 egg-download signature events, 
7 E4 C&C signature events and 3,904 E5 scan events. 
From these dialog warnings, the BotHunter correlator 


produced just one bot profile. Our subsequent analysis of 
the packets that caused the bot profile found that this was 
a false alarm. Upon packet inspection, it was found that 
the session for which the bot declaration occurred con- 
sisted of a 1.6 GB multifile FTP transfer, during which a 
binary image was transferred with content that matched 
one of our buffer overflow detection patterns. The buffer 
overflow false alarm was coupled with a second MS Win- 
dows binary download, which caused BotHunter to cross 
our detection threshold and declare a bot infection. 


6 Limitations and Future Work 


Several important practical considerations present chal- 
lenges in extending and adapting BotHunter for arbitrary 
networks in the future. 

Adapting to Emerging Threats and Adversaries: 
Network defense is a perennial arms race’ and we an- 
ticipate that the threat landscape could evolve in several 
ways to evade BotHunter. First, bots could use encrypted 
communication channels for C&C. Second, they could 
adopt more stealthy scanning techniques. However, the 
fact remains that hundreds of thousands of systems re- 
main unprotected, attacks still happen in the clear, and 
adversaries have not been forced to innovate. Open- 
source systems such as BotHunter would raise the bar for 
successful infections. Moreover, BotHunter could be ex- 
tended with anomaly-based “entropy detectors” for iden- 
tification of encrypted channels. We have preliminary re- 
sults that are promising and defer deeper investigation to 
future work. We are also developing new anomaly-based 
C&C detection schemes (for E4). 

It is also conceivable that if BotHunter is widely de- 
ployed, adversaries would devise clever means to evade 
the system, e.g., by using attacks on BotHunter’s dia- 
log history timers. One countermeasure is to incorporate 
an additional random delay to the hard prune interval, 
thereby introducing uncertainty into how long BotHunter 
maintains local dialog histories. 

Incorporating Additional State Logic: The current 
set of states in the bot infection model was based on the 
behavior of contemporary bots. As bots evolve, it is con- 
ceivable that this set of states would have to be extended 
or otherwise modified to reflect the current threat land- 
scape. This could be accomplished with simple config- 
uration changes to the BotHunter correlator. We expect 
such changes to be fairly infrequent as they reflect fun- 


7In this race, we consider BotHunter to be a substantial technologi- 
cal escalation for the white hats. 
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Figure 7: Honeynet Interaction Summary (left) and corresponding BotHunter Profile (right) for W32/IRCBot-TO 


damental paradigm shifts in bot behavior. 


7 Related Work 


Recently, there has been a significant thrust in research 
on botnets. To date, the primary focus of much of this 
research has been on gaining a basic understanding of 
the nature and full potential of the botnet threat. Rajab et 
al. provided an in-depth study in understanding the dy- 
namics of botnet behavior in the large, employing “lon- 
gitudinal tracking” of IRC botnets through IRC and DNS 
tracking techniques [33]. Researchers have also studied 
the dynamics of botnet C&C protocols [19, 50], includ- 
ing global dynamics such as diurnal behavior [14]. Other 
studies have investigated the internals of bot instances to 
examine the structural similarities, defense mechanisms, 
and command and control capabilities of the major bot 
families [7] and developed techniques to automatically 
harvest malware samples directly from the Internet [6]. 
There is also some very recent work on the detection of 
botnets. Rishi [21] is an IRC botnet detection system 
that uses n-gram analysis to identify botnet nickname 
patterns. Binkley and Singh [9] proposed an anomaly 
based system that combines IRC statistics and TCP work 
weight for detecting IRC-based botnets. Livadas et al. 
[26] proposed a machine learning based approach for 
botnet detection. Karasaridis et al. [25] presented an 
algorithm for detecting IRC botnet controllers from net- 
flow records. These efforts are complementary in that 
they could provide additional BotHunter evidence-trails 
for infection events. 


A significant amount of related work has investigated 
alert correlation techniques for network intrusion detec- 
tion. An approach to capturing complex and multistep 
attacks is to explicitly specify the stages, relationships, 
and ordering among the various constituents of an at- 
tack. GriDS [39] aggregates network activity into ac- 
tivity graphs that can be used for analyzing causal struc- 
tures and identifying policy violations. CARDS is a dis- 
tributed system for detecting and mitigating coordinated 
attacks [49]. Abad et al. [2] proposed to correlate data 
among different sources/logs (e.g., syslog, firewall, net- 
flow) to improve intrusion detection system accuracy. E]l- 
lis et al. and Jiang et al. describe two behavioral-based 
systems for detecting network worms [17, 23]. In con- 
trast to the above systems, our work focuses on the prob- 
lem of bot detection and uses infection dialog correlation 
as a means to define the probable set of events that indi- 
cate a bot infection. 


Sommer et al. [36] describe contextual Bro signa- 
tures as a means for producing expressive signatures 
and weeding out false positives. These signatures cap- 
ture two dialogs and are capable of precisely defin- 
ing multistep attacks. Our work differs from this in 
our requirement to simultaneously monitor several flows 
across many participants (e.g., infection source, bot vic- 
tim, C&C, propagation targets) and our evidence-trail- 
based approach to loosely specify bot infections. 


JIGSAW is a system that uses notions of concepts 
and capabilities for modeling complex attacks [40] 
and [29] provides a formal framework for alert correla- 
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tion. CAML is a language framework for defining and 
detecting multistep attack scenarios [11]. Unlike BotH- 
unter, all these systems are based on causal relationships 
i.e., pre-conditions and post-conditions of attacks. An 
obvious limitation is that these dependencies, need to be 
manually specified a priori for all attacks, and yet such 
dependencies are often unknown. 

Alert correlation modules such as CRIM [13] provide 
the ability to cluster and correlate similar alerts. The sys- 
tem has the capability to extract higher-level correlation 
rules automatically for the purpose of intention recog- 
nition. In [42], Valdes and Skinner propose a two-step 
probabilistic alert correlation based on attack threads and 
alert fusion. We consider this line of work to be comple- 
mentary, i.e., these fusion techniques could be integrated 
into the BotHunter framework as a preprocessing step in 
a multisensor environment. 

USTAT [22] and NetSTAT [44] are two IDSs based on 
state transition analysis techniques. They specify com- 
puter attacks as sequences of actions that cause transi- 
tions in the security state of a system. In [43], multistep 
attack correlation is performed on attack scenarios spec- 
ified (a priori) using STATL [16], which is a language 
for expressing attacks as states and transitions. Our work 
differs from these systems in that we do not have a strict 
requirement of temporal sequence, and can tolerate miss- 
ing events during the infection flow. 


8 BotHunter Internet Distribution 


We are making BotHunter available as a free Internet dis- 
tribution for use in testing and facilitating research with 
the hope that this initiative would stimulate community 
development of extensions. 

A key component of the BotHunter distribution is the 
Java-based correlator that by default reads alert streams 
from Snort. We have tested our system with Snort 2.6.* 
and it can be downloaded from www.cyber-ta.org/botHunter/. A 
noteworthy feature of the distribution is integrated sup- 
port for “large-scale privacy-preserving data sharing”. 
Users can enable an option to deliver secure anonymous 
bot profiles to the Cyber-TA security repository [32], the 
collection of which we will make available to providers 
and researchers. The repository is currently operational 
and in beta release of its first report delivery software. 

Our envisioned access model is similar to that of 
DShield.org [41] with the following important differ- 
ences. First, our repository is blind to who is submit- 
ting the bot report and the system will deliver alerts via 
TLS over TOR, preventing an association of bot reports 
to a site via passive sniffing. Second, our anonymiza- 
tion strategy obfuscates all local IP addresses and time 
intervals in the profile database but preserves C&C, egg 
download, and attacker addresses that do not match user 
defined address proximity mask. Users can enable fur- 


ther field anonymizations as they require. We intend to 
use contributed bot profiles to learn specific alert signa- 
ture patterns for specific bots, to track attackers, and to 
identify C&C sites. 


9 Conclusion 


We have presented the design and implementation of 
BotHunter, a perimeter monitoring system for real-time 
detection of Internet malware infections. The corner- 
stone of the BotHunter system is a three-sensor dialog 
correlation engine that performs alert consolidation and 
evidence trail gathering for investigation of putative in- 
fections. We evaluate the system’s detection capabili- 
ties in an in situ virtual network and a live honeynet 
demonstrating that the system is capable of accurately 
flagging both well-studied and emergent bots. We also 
validate low false positive rates by running the system 
live in two operational production networks. Our ex- 
perience demonstrates that the system is highly scalable 
and reliable (very low false positive rates) even with not- 
so-reliable (weak) raw detectors. BotHunter is also the 
first example of a widely distributed bot infection pro- 
file analysis tool. We hope that our Internet release will 
enable the community to extend and maintain this capa- 
bility while inspiring new research directions. 
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Integrity Checking in Cryptographic File Systems 
with Constant Trusted Storage 


Alina Oprea* 


Abstract 


In this paper we propose two new constructions for pro- 
tecting the integrity of files in cryptographic file systems. 
Our constructions are designed to exploit two charac- 
teristics of many file-system workloads, namely low en- 
tropy of file contents and high sequentiality of file block 
writes. At the same time, our approaches maintain the 
best features of the most commonly used algorithm to- 
day (Merkle trees), including defense against replay of 
stale (previously overwritten) blocks and a small, con- 
stant amount of trusted storage per file. Via implementa- 
tions in the EncFS cryptographic file system, we evalu- 
ate the performance and storage requirements of our new 
constructions compared to those of Merkle trees. We 
conclude with guidelines for choosing the best integrity 
algorithm depending on typical application workload. 


1 Introduction 


The growth of outsourced storage in the form of storage 
service providers underlines the importance of develop- 
ing efficient security mechanisms to protect files stored 
remotely. Cryptographic file systems (e.g., [10, 6, 25, 
13, 17, 23, 20]) provide means to protect file secrecy (i.e., 
prevent leakage of file contents) and integrity (i.e., detect 
the unauthorized modification of file contents) against 
the compromise of the file store and attacks on the net- 
work while blocks are in transit to/from the file store. 
Several engineering goals have emerged to guide the de- 
sign of efficient cryptographic file systems. First, crypto- 
graphic protections should be applied at the granularity 
of individual blocks as opposed to entire files, since the 
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latter requires the entire file to be retrieved to verify its 
integrity, for example. Second, applying cryptographic 
protections to a block should not increase the block size, 
so as to be transparent to the underlying block store. 
(Cryptographic protections might increase the number of 
blocks, however.) Third, the trusted storage required by 
clients (e.g., for encryption keys and integrity verifica- 
tion information) should be kept to a minimum. 

In this paper we propose and evaluate two new algo- 
rithms for protecting file integrity in cryptographic file 
systems. Our algorithms meet these design goals, and 
in particular implement integrity using only a small con- 
stant amount of trusted storage per file. (Of course, as 
with any integrity-protection scheme, this trusted infor- 
mation for many files could itself be written to a file 
in the cryptographic file system, thereby reducing the 
trusted storage costs for many files to that of only one. 
The need for trusted information cannot be entirely elim- 
inated, however.) In addition, our algorithms exploit two 
properties of many file-system workloads to achieve effi- 
ciencies over prior proposals. First, typical file contents 
in many file-system workloads have low empirical en- 
tropy; such is the case with text files, for example. Our 
first algorithm builds on our prior proposal that exploits 
this property [26] and uses tweakable ciphers [21, 15] 
for encrypting file block contents; this prior proposal, 
however, did not achieve constant trusted storage per file. 
Our second algorithm reduces the amount of additional 
storage needed for integrity by using the fact that low- 
entropy block contents can be compressed enough to em- 
bed a message-authentication code inside the block. The 
second property that we exploit in our algorithms to re- 
duce the additional storage needed for integrity is that 
blocks of the same file are often written sequentially, a 
characteristic that, to our knowledge, has not been previ- 
ously utilized. 
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By designing integrity mechanisms that exploit these 
properties, we demonstrate more efficient integrity pro- 
tections in cryptographic file systems than have previ- 
ously been possible for many workloads. The measures 
of efficiency that we consider include the amount of un- 
trusted storage required by the integrity mechanism (over 
and above that required for file blocks); the integrity 
bandwidth, i.e., the amount of this information that must 
be accessed (updated or read) when accessing a single 
file block, averaged over all blocks in a file, all blocks in 
all files, or all accesses in a trace (depending on context); 
and the file write and read performance costs. 

The standard against which we compare our algo- 
rithms is the Merkle tree [24], which to date is the over- 
whelmingly most popular method of integrity protection 
for a file. Merkle trees can be implemented in crypto- 
graphic file systems so as to meet the requirements out- 
lined above, in particular requiring trusted storage per file 
of only one output of a cryptographic hash function (e.g., 
20 bytes for SHA-1 [30]). They additionally offer an in- 
tegrity bandwidth per file that is logarithmic in the num- 
ber of file blocks. However, Merkle trees are oblivious 
to file block contents and access characteristics, and we 
show that by exploiting these, we can generate far more 
efficient integrity mechanisms for some workloads. 

We have implemented our integrity constructions and 
Merkle trees in EncFS [14], an open-source user-level 
file system that transparently provides file block encryp- 
tion on top of FUSE [12]. We provide an evaluation of 
the three approaches with respect to our measures of in- 
terest, demonstrating how file contents, as well as file ac- 
cess patterns, have a great influence on the performance 
of the new integrity algorithms. Our experiments demon- 
strate that there is not a clear winner among the three 
constructions for all workloads, in that different integrity 
constructions are best suited to particular workloads. We 
thus conclude that a cryptographic file system should im- 
plement all three schemes and give higher-level applica- 
tions an option to choose the appropriate integrity mech- 
anism. 


2 Random Access Integrity Model 


We consider the model of a cryptographic file system that 
provides random access to files. Encrypted data is stored 
on untrusted storage servers and there is a mechanism 
for distributing the cryptographic keys to authorized par- 
ties. A small (on the order of several hundred bytes), 
fixed-size per file, trusted storage is available for authen- 
tication data. 

We assume that the storage servers are actively con- 
trolled by an adversary. The adversary can adaptively 


alter the data stored on the storage servers or perform 
any other attack on the stored data, but it cannot modify 
or observe the trusted storage. A particularly interesting 
attack that the adversary can mount is a replay attack, in 
which stale data is returned to read requests of clients. 
Using the trusted storage to keep some constant-size in- 
formation per file, and keeping more information per file 
on untrusted storage, our goal is to design and evaluate 
integrity algorithms that allow the update and verification 
of individual blocks in files and that detect data modifi- 
cation and replay attacks. 

In our framework, a file F' is divided into 7 fixed-size 
blocks B, Bz... By, (the last block B,, might be shorter 
than the first n — 1 blocks), each encrypted individually 
with the encryption key of the file and stored on the un- 
trusted storage servers (n differs per file). The constant- 
size, trusted storage for file F' is denoted TS. Addi- 
tional storage for file F’, which can reside in untrusted 
storage, is denoted US; of course, US can be written 
to the untrusted storage server. 

The storage interface provides two basic operations to 
the clients: F’.WriteBlock(7, C’)) stores content C' at block 
index 7 in file F and C — F.ReadBlock(i) reads (en- 
crypted) content from block index 2 in file fF. An in- 
tegrity algorithm for an encrypted file system consists of 
five operations. In the initialization algorithm Init for 
file F’, the encryption key for the file is generated. In an 
update operation Update(?, B) for file F’, an authorized 
client updates the 7-th block in the file with the encryp- 
tion of block content B and updates the integrity infor- 
mation for the i-th block stored in TS and USp. In 
the check operation Check(7, C’) for file F’, an authorized 
client first decrypts C' and then checks that the decrypted 
block content is authentic, using the additional storage 
TSp and USF for file F'. The check operation returns 
the decrypted block if it concludes that the block content 
is authentic and | otherwise. A client can additionally 
perform an append operation Append(B) for file F’, in 
which a new block that contains the encryption of B is 
appended to the encrypted file, and a Delete operation 
that deletes the last block in a file and updates the in- 
tegrity information for the file. 

Using the algorithms we have defined for an integrity 
scheme for an encrypted file, a client can read or write at 
any byte offset in the file. For example, to write to a byte 
offset that is not at a block boundary, the client first reads 
the block to which the byte offset belongs, decrypts it 
and checks its integrity using algorithm Check. Then, the 
client constructs the new data block by replacing the ap- 
propriate bytes in the decrypted block, and calls Update 
to encrypt the new block and compute its integrity infor- 
mation. 
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In designing an integrity algorithm for a cryptographic 
file system, we consider the following metrics. First is 
the size of the untrusted storage US; we will always 
enforce that the trusted storage TS is of constant size, 
independent of the number of blocks. Second is the in- 
tegrity bandwidth for updating and checking individual 
file blocks, defined as the number of bytes from US ¢ ac- 
cessed (updated or read) when accessing a block of file 
F’, averaged over either: all blocks in fF when we speak 
of a per-file integrity bandwidth; all blocks in all files 
when we speak of the integrity bandwidth of the file sys- 
tem; or all blocks accessed in a particular trace when we 
speak of one trace. Third is the performance cost of writ- 
ing and reading files. 


3 Preliminaries 


3.1 Merkle Trees 


Merkle trees [24] are used to authenticate n data items 
with constant-size trusted storage. A Merkle tree for 
data items M),...,M,, denoted MT(Mj,...,M,,.), is 
a binary tree that has M,,...,M,, as leaves. An inte- 
rior node of the tree with children Cy; and Cp is the hash 
of the concatenation of its children (i.e., h(Cz||Cr), for 
h: {0,1}* — {0,1}* a second preimage resistant hash 
function [29] that outputs strings of length s bits). If the 
root of the tree is stored in trusted storage, then all the 
leaves of the tree can be authenticated by reading from 
the tree a number of hashes logarithmic in n. 

We define the Merkle tree for a file F' with n blocks 
B,,..., By to be the binary tree MT r = MT(h(1||B1), 
...,h(n||B,,)). A Merkle tree with a given set of leaves 
can be constructed in multiple ways. We choose to ap- 
pend a new block in the tree as a right-most child, so 
that the tree has the property that all the left subtrees are 
complete. We define several algorithms for a Merkle tree 
T, for which we omit the implementation details, due to 
space limitations. 

- In the UpdateTree(R, 7, hval) algorithm for tree T, 
the hash stored at the 7-th leaf of T (counting from left 
to right) is updated to hval. This triggers an update of 
all the hashes stored on the path from the 2-th leaf to the 
root of the tree. It is necessary to first check that all the 
siblings of the nodes on the path from the updated leaf 
to the root of the tree are authentic. Finally, the updated 
root of the tree is output in R. 

- The CheckTree(R,7,hval) algorithm for tree T 
checks that the hash stored at the 7-th leaf matches hval. 
All the hashes stored at the nodes on the path from the 
i-th leaf to the root are computed and the root of T' is 
checked finally to match the value stored in R. 


- Algorithm AppendTree(R, hval) for tree T appends 
a new leaf u that stores the hash value hval to the tree, 
updates the path from this new leaf to the root of the tree 
and outputs the new root of the tree in R. 

- The DeleteTree(R) algorithm for tree T deletes the 
last leaf from the tree, updates the remaining path to the 
root of the tree and outputs the new root of the tree in R. 


3.2 Encryption Schemes and Tweakable 
Ciphers 


An encryption scheme consists of a key generation algo- 
rithm Gen that outputs an encryption key, an encryption 
algorithm E;,(1/) that outputs the encryption of a mes- 
sage M with secret key k and a decryption algorithm 
D;,(C) that outputs the decryption of a ciphertext C' with 
secret key k. A widely used secure encryption scheme is 
AES [2] in CBC mode [8]. 

A tweakable cipher [21, 15] is, informally, a length- 
preserving encryption method that uses a tweak in both 
the encryption and decryption algorithms for variability. 
A tweakable encryption of a message M with tweak t 
and secret key k is denoted Ej,(/) and, similarly, the 
decryption of ciphertext C’ with tweak ¢ and secret key 
k is denoted Dj,(C). The tweak is a public parameter, 
and the security of the tweakable cipher is based only on 
the secrecy of the encryption key. Tweakable ciphers can 
be used to encrypt fixed-size blocks written to disk in a 
file system. Suitable values of the tweak for this case 
are, for example, block addresses or block indices in the 
file. There is a distinction between narrow-block tweak- 
able ciphers that operate on block lengths of 128 bits (as 
regular block ciphers) and wide-block tweakable ciphers 
that operate on arbitrarily large blocks (e.g., 512 bytes or 
4KB). In this paper we use the term tweakable ciphers 
to refer to wide-block tweakable ciphers as defined by 
Halevi and Rogaway [15]. 

The security of tweakable ciphers implies an interest- 
ing property, called non-malleability [15], that guaran- 
tees that if only a single bit is changed in a valid cipher- 
text, then its decryption is indistinguishable from a ran- 
dom plaintext. Tweakable cipher constructions include 
CMC [15] and EME [16]. 


3.3 Efficient Block Integrity Using Ran- 
domness of Block Contents 


Oprea et al. [26] provide an efficient integrity construc- 
tion in a block-level storage system. This integrity con- 
struction is based on the experimental observation that 
contents of blocks written to disk usually are efficiently 
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distinguishable from random blocks, i.e., blocks uni- 
formly chosen at random from the set of all blocks of 
a fixed length. Assuming that data blocks are encrypted 
with a tweakable cipher, the integrity of the blocks that 
are efficiently distinguishable from random blocks can be 
checked by performing a randomness test on the block 
contents. The non-malleability property of tweakable 
ciphers implies that if block contents after decryption 
are distinguishable from random, then it is very likely 
that the contents are authentic. This idea permits a re- 
duction in the trusted storage needed for checking block 
integrity: a hash is stored only for those (usually few) 
blocks that are indistinguishable from random blocks (or, 
in short, random-looking blocks). 

An example of a statistical test IsRand [26] that 
can be used to distinguish block contents from ran- 
dom blocks evaluates the entropy of a block and con- 
siders random those blocks that have an entropy higher 
than a threshold chosen experimentally. For a block B, 
IsRand(B) returns | with high probability if B is a uni- 
formly random block in the block space and 0, other- 
wise. Oprea et al. [26] provide an upper bound on the 
false negative rate of the randomness test that is used in 
the security analysis of the scheme. 

We use the ideas from Oprea et al. [26] as a start- 
ing point for our first algorithm for implementing file 
integrity in cryptographic file systems. The main chal- 
lenge to construct integrity algorithms in our model is to 
efficiently reduce the amount of trusted storage per file 
to a constant value. Our second algorithm also exploits 
the redundancy in file contents to reduce the additional 
space for integrity, but in a different way, by embedding 
a message authentication code (MAC) in file blocks that 
can be compressed enough. Both of these schemes build 
from a novel technique that is described in Section 4.1 for 
efficiently tracking the number of writes to file blocks. 


4 Write Counters for File Blocks 


All the integrity constructions for encrypted storage de- 
scribed in the next section use write counters for the 
blocks in a file. A write counter for a block denotes the 
total number of writes done to that block index. Coun- 
ters are used to reduce the additional storage space taken 
by encrypting with a block cipher in CBC mode, as de- 
scribed in Section 4.2. Counters are also a means of dis- 
tinguishing different writes performed to the same block 
address and as such, can be used to prevent against replay 
attacks. 

We define several operations for the write counters of 
the blocks in a file F'. The UpdateCtr(i) algorithm ei- 
ther initializes the value of the counter for the 2-th block 


in file F’ with 1, or it increments the counter for the 7- 
th block if it has already been initialized. The algorithm 
also updates the information for the counters stored in 
US r. Function GetCtr(7) returns the value of the counter 
for the z-th block in file F’. When counters are used to 
protect against replay attacks, they need to be authenti- 
cated with a small amount of trusted storage. For au- 
thenticating block write counters, we define an algorithm 
AuthCtr that modifies the trusted storage space TSr of 
file f to contain the trusted authentication information 
for the write counters of F’, and a function CheckCtr that 
checks the authenticity of the counters stored in US r us- 
ing the trusted storage TSpr for file F and returns true 
if the counters are authentic and false, otherwise. Both 
operations for authenticating counters are invoked by an 
authorized client. 


4.1 Storage and Authentication of Block 
Write Counters 


A problem that needs to be addressed in the design of the 
various integrity algorithms described below is the stor- 
age and authentication of the block write counters. If a 
counter per file block were used, this would result in sig- 
nificant additional storage for counters. Here we propose 
a more efficient method of storing the block write coun- 
ters, based on analyzing the file access patterns in NFS 
traces collected at Harvard University [9]. 


Counter intervals. We performed experiments on the 
NES Harvard traces [9] in order to analyze the file access 
patterns. We considered three different traces (LAIR, 
DEASNA and HOME02) for a period of one week. The 
LAIR trace consists of research workload traces from 
Harvard’s computer science department. The DEASNA 
trace is a mix of research and email workloads from the 
division of engineering and applied sciences at Harvard. 
HOME072 is mostly the email workload from the campus 
general purpose servers. 

Ellard et al. [9] make the observation that a large num- 
ber of file accesses are sequential. This leads to the idea 
that the values of the write counters for adjacent blocks 
in a file might be correlated. To test this hypothesis, we 
represent counters for blocks in a file using counter in- 
tervals. A counter interval is defined as a sequence of 
consecutive blocks in a file that all share the same value 
of the write counter. For a counter interval, we need to 
store only the beginning and end of the interval, and the 
value of the write counter. 

Table 1 shows the average storage per file used by the 
two counter representation methods for the three traces. 
We represent a counter using 2 bytes (as the maximum 
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observed value of a counter was 9905) and we repre- 
sent file block indices with 4 bytes. The counter inter- 
val method reduces the average storage needed for coun- 
ters by a factor of 30 for the LAIR trace, 26.5 for the 
DEASNA trace and 7.66 for the HOMEO02 trace com- 
pared to the method that stores a counter per file block. 
This justifies our design choice to use counter inter- 
vals for representing counter values in the integrity al- 
gorithms presented in the next section. 


























LAIR DEASNA HOME02 
Counter per block 547.8 bytes 1.46 KB 3.16 KB 
Counter intervals 18.35 bytes | 55.04 bytes | 413.44 bytes 








Table 1: Average storage per file for two counter repre- 
sentation methods. 


Counter representation. The counter intervals for file 
F are represented by two arrays: IntStarte keeps the 
block indices where new counter intervals start and 
CtrValz keeps the values of the write counter for each 
interval. The trusted storage TS r for file F' includes ei- 
ther the arrays IntStart and CtrVal- if they fit into TS 
or, for each array, a hash of all its elements (concate- 
nated), otherwise. In the limit, to reduce the bandwidth 
for integrity, we could build a Merkle tree to authenticate 
each of these arrays and store the root of these trees in 
TS r, but we have not seen in the Harvard traces files that 
would warrant this. We omit here the implementation de- 
tails for the UpdateCtr, GetCtr, AuthCtr and CheckCtr 
operations on counters, due to space limitations. 

If the counter intervals for a file get too dispersed, then 
the size of the arrays IntStartr and CtrVale might in- 
crease significantly. To keep the untrusted storage for 
integrity low, we could periodically change the encryp- 
tion key for the file, re-encrypt all blocks in the file, and 
reset the block write counters to 0. 


4.2 Length-Preserving Stateful Encryption 
with Counters 


Secure encryption schemes are usually not length- 
preserving. However, one of our design goals stated in 
the introduction is to add security (and, in particular, en- 
cryption) to file systems in a manner transparent to the 
storage servers. For this purpose, we introduce here the 
notion of a length-preserving stateful encryption scheme 
for a file F’, an encryption scheme that encrypts blocks in 
a way that preserves the length of the original blocks, and 
stores any additional information in the untrusted storage 
space for the file. We define a length-preserving stateful 
encryption scheme for a file F’ to consist of a key gen- 
eration algorithm G'°" that generates an encryption key 
for the file, an encryption algorithm E'®" that encrypts 


block content B for block index 2 with key k and out- 
puts ciphertext C’, and a decryption algorithm D'*" that 
decrypts the encrypted content C’ of block i with key k 
and outputs the plaintext B. Both the E'®" and D'°" al- 
gorithms also modify the untrusted storage space for the 
file. 

Tweakable ciphers are by definition length-preserving 
stateful encryption schemes. A different construction on 
which we elaborate below uses write counters for file 
blocks. Let (Gen, E,D) be an encryption scheme con- 
structed from a block cipher in CBC mode. To encrypt 
an n-block message in the CBC encryption mode, a ran- 
dom initialization vector is chosen. The ciphertext con- 
sists of n + 1 blocks, with the first being the initialization 
vector. We denote by E;,,(B, iv) the output of the encryp- 
tion of B (excluding the initialization vector) using key 
k and initialization vector iv, and similarly by D;,(C, iv) 
the decryption of C' using key k and initialization vector 
av. 

We replace the random initialization vectors for en- 
crypting a file block with a pseudorandom function ap- 
plication of the block index concatenated with the write 
counter for the block. This is intuitively secure because 
different initialization vectors are used for different en- 
cryptions of the same block, and moreover, the proper- 
ties of pseudorandom functions imply that the initializa- 
tion vectors are indistinguishable from random. It is thus 
enough to store the write counters for the blocks of a file, 
and the initialization vectors for the file blocks can be 
easily inferred. 

The G'e", E'°" and D'©" algorithms for a file F' are de- 
scribed in Figure 1. Here PRF : Kprp xZ — B denotes a 
pseudorandom function family with key space Kprr, in- 
put space T (i.e., the set of all block indices concatenated 
with block counter values), and output space 6 (i.e., the 
block space of E). 


5 Integrity Constructions for Encrypted 
Storage 


In this section, we first present a Merkle tree integrity 
construction for encrypted storage, used in file systems 
such as Cepheus [10], FARSITE [1], and Plutus [17]. 
Second, we introduce a new integrity construction based 
on tweakable ciphers that uses some ideas from Oprea 
et al. [26]. Third, we give a new construction based on 
compression levels of block contents. We evaluate the 
performance of the integrity algorithms described here 
in Section 7. 
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G""(F): Ee ko > (Fs 4, B): 
ky ee Kpre F.UpdateCtr(z) 


return <k1, ko> C — Ex, (B, iv) 
return C 








DEN. kg > (Fs 4, ©): 
iv — PRFx, (i||F'.GetCtr(z)) 


ka — Gen() iv — PRF,, (2||F.GetCtr(z)) B — Dg, (C, iv) 


return B 








Figure 1: Implementing a length-preserving stateful encryption scheme with write counters. 





F’.Update(i, B): 
k — F-enc_key 
MT p.UpdateTree(TS p, 2, h(i||B)) 
CER (Fi, B) 


F’.Check(i, C): 
k — F-.enc_key 
B, — De"(F, i, C) 
if MT p.CheckTree(TS r, i, h(i||B;)) = true 





n — F-.blocks 

MT #.AppendTree(TSr, h(n + 1||B)) 
COER(F,n+1,B) 

F WriteBlock(n + 1, C) 








F WriteBlock(i, C) return B; 
else 
return L 
F.Append(B): F.Delete(): 
k — F-enc_key n <— F-.blocks 


MT F.DeleteTree(TS -) 
delete B,, from file F’ 





Figure 2: The Update, Check, Append and Delete algorithms for the MT-EINT construction. 


5.1 The Merkle’ Tree Construction 
MT-EINT 


In this construction, file blocks can be encrypted with 
any length-preserving stateful encryption scheme and 
they are authenticated with a Merkle tree. More pre- 
cisely, if F’ is a file comprised of blocks B,,..., Bn, 
then the untrusted storage for integrity for file F’ is 
USp = MTr(A(1||B1),..., 2(n||Bn)) (for h a second- 
preimage resistant hash function), and the trusted storage 
TSr is the root of this tree. 

The algorithm Init runs the key generation algorithm 
G'e" of the length-preserving stateful encryption scheme 
for file F’. The algorithms Update, Check, Append and 
Delete of the MT-EINT construction are given in Fig- 
ure 2. We denote here by F.enc_key the encryption key 
for file F (generated in the Init algorithm) and F’.blocks 
the number of blocks in file F’. 

- In the Update(2, B) algorithm for file F’, the -th leaf 
in MT - is updated with the hash of the new block con- 
tent using the algorithm UpdateTree and the encryption 
of B is stored in the z-th block of F’. 

- To append a new block B to file F’' with algorithm 
Append(B), a new leaf is appended to MT with the 
algorithm AppendTree, and then an encryption of B is 
stored in the (n + 1)-th block of F (for n the number of 
blocks of F’). 

- In the Check(z,C’) algorithm for file F', block C 
is decrypted, and its integrity is checked using the 
CheckTree algorithm. 

- To delete the last block from a file F’ with algorithm 
Delete, the last leaf in MT F is deleted with the algorithm 


DeleteTree. 
The MT-EINT construction detects data modification 


and block swapping attacks, as file block contents are 
authenticated by the root of the Merkle tree for each file. 
The MT-EINT construction is also secure against replay 
attacks, as the tree contains the hashes of the latest ver- 
sion of the data blocks and the root of the Merkle tree is 
authenticated in trusted storage. 


5.2 The Randomness Test Construction 
RAND-EINT 


Whereas in the Merkle tree construction any length- 
preserving stateful encryption algorithm can be used to 
individually encrypt blocks in a file, the randomness test 
construction uses the observation from Oprea et al. [26] 
that the integrity of the blocks that are efficiently dis- 
tinguishable from random blocks can be checked with a 
randomness test if a tweakable cipher is used to encrypt 
them. As such, integrity information is stored only for 
random-looking blocks. 

In this construction, a Merkle tree per file that authen- 
ticates the contents of the random-looking blocks is built. 
The untrusted storage for integrity US for file /' com- 
prised of blocks B,,..., By, includes this tree RTreer = 
MT(A(é||B;) : @ € {1,...,n} and IsRand(B;) = 
1), and, in addition, the set of block numbers that 
are random-looking RArre = {i € {1,...,n} 
IsRand(B;) = 1}, ordered the same as the leaves in 
the previous tree RTree;. The root of the tree RTree is 
kept in the trusted storage TS r for file F’. 

To prevent against replay attacks, clients need to dis- 
tinguish different writes of the same block in a file. A 
simple idea [26] is to use a counter per file block that de- 
notes the number of writes of that block, and make the 
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F.Update(i, B) : 
k — F-.enc_key 
F-.UpdateCtr(i) 
F.AuthCtr() 
if IsRand(B) = 0 
if i € RArre 
RTreep.DelOffsetTree(TSr, RArrp, i) 
else 
if i € RArre 
j <— RArrp.SearchOffset (2) 
RTreer .UpdateTree(TSr, 7, h(i||B)) 


F’.Check(i, C): 
k — F-.enc_key 
if F.CheckCtr() = false 
return L 
Bow Pye ets F Sete) (Cs) 
if IsRand(B;) = 0 
return B; 
else 
if i € RArre 
j <— RArrp.SearchOffset (7) 
if RTreep.CheckTree(TS r, j, h(i||B;)) = true 





F.UpdateCtr(n + 1) 
F-.AuthCtr() 
if IsRand(B) = 1 
RTreer.AppendTree(TSr, h(n + 1||B)) 
append n + 1 at end of RArrr 
FMWriteBlock(n + 1, ETNA AYE SRC THEA) (phy 





else return B; 
RTreer .AppendTree(TS Fr, h(i||B)) else 
append i at end of RArrz return L 
F.WriteBlock(é, EF” 'wek@F-GetCe@) (By) else 
return L 
F.Append(B): F.Delete(): 
k — F-.enc_key n <— F-.blocks 
n — F.blocks ifn € RArre 





RTreep.DelOffsetTree(TSr, RArrr, n) 
delete B,, from file F’ 








Figure 3: The Update, Check, Append and Delete algorithms for the RAND-EINT construction. 


counter part of the encryption tweak. The block write 
counters need to be authenticated in the trusted storage 
space for the file Ff to prevent clients from accepting 
valid older versions of a block that are considered not 
random by the randomness test. To ensure that file blocks 
are encrypted with different tweaks, we define the tweak 
for a file block to be a function of the file, the block index 
and the block write counter. We denote by F. Tweak the 
tweak-generating function for file F’ that takes as input 
a block index and a block counter and outputs the tweak 
for that file block. The properties of tweakable ciphers 
imply that if a block is decrypted with a different counter 
(and so a different tweak), then it will look random with 
high probability. 

The algorithm Init selects a key at random from 
the key space of the tweakable encryption scheme E. 
The Update, Check, Append and Delete algorithms of 
RAND-EINT are detailed in Figure 3. For the array 
RArrz, RArrp.items denotes the number of items in the 
array, RArrp.last denotes the last element in the array, 
and the function SearchOffset(i) for the array RArrp 
gives the position in the array where index 27 is stored 
(if it exists in the array). 

- In the Update(i, B) algorithm for file F’, the write 
counter for block z is incremented and the counter au- 
thentication information from TSF is updated with the 
algorithm AuthCtr. Then, the randomness test IsRand 
is applied to block content B. If B is not random look- 
ing, then the leaf corresponding to block 7 (if it exists) 
has to be removed from RTreer. This is done with the 
algorithm DelOffsetTree, described in Figure 4. On the 


other hand, if B is random-looking, then the leaf corre- 
sponding to block 7 has to be either updated with the new 
hash (if it exists in the tree) or appended in RTreep. Fi- 
nally, the tweakable encryption of B is stored in the i-th 
block of F. 

- To append a new block B to file F' with n blocks 
using the Append(B) algorithm, the counter for block 
n + 1 is updated first with algorithm UpdateCtr. The 
counter authentication information from trusted storage 
is also updated with algorithm AuthCtr. Furthermore, 
the hash of the block index concatenated with the block 
content is added to RTreepr only if the block is random- 
looking. In addition, index n+1 is added to RArr in this 
case. Finally, the tweakable encryption of B is stored in 
the (n + 1)-th block of F’. 

- In the Check(i, C’)) algorithm for file F’, the authenti- 
cation information for the block counters is checked first. 
Then block C' is decrypted, and checked for integrity. 
If the content of the 7-th block is not random-looking, 
then by the properties of tweakable ciphers we can infer 
that the block is valid with high probability. Otherwise, 
the integrity of the z-th block is checked using the tree 
RTreer. If 7 is not a block index in the tree, then the 
integrity of block 7 is unconfirmed and the block is re- 
jected. 

- In the Delete algorithm for file F’, the hash of the last 
block has to be removed from the tree by calling the algo- 
rithm DelOffset Tree (described in Figure 4), in the case 
in which the last block is authenticated through RTreep. 

It is not necessary to authenticate in trusted storage the 
array RArrp of indices of the random-looking blocks in 
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T.DelOffsetTree(TSr, RArrr, 7): 

j <— RArrp.SearchOffset (2) 

lL — RArrp.last 

fj Al 
T.UpdateTree(TS r, j, h(l||Bz)) 
RArre[j] <— 1 
RArrg.items — RArrg.items — 1 

T.DeleteTree(TS -) 








Figure 4: The DelOffsetTree algorithm for a tree T’ 
deletes the hash of block 7 from T' and moves the last 
leaf to its position, if necessary. 


a file. The reason is that the root of RTreer is authenti- 
cated in trusted storage and this implies that an adversary 
cannot modify the order of the leaves in RTreer with- 
out being detected in the AppendTree, UpdateTree or 
CheckTree algorithms. 

The construction RAND-EINT protects against unau- 
thorized modification of data written to disk and block 
swapping attacks by authenticating the root of RTreer 
in the trusted storage space for each file. By using write 
counters in the encryption of block contents and authen- 
ticating the values of the counters in trusted storage, this 
construction provides defense against replay attacks and 
provides all the security properties of the MT-EINT con- 
struction. 


5.3. The Compression Construction 


COMP-EINT 


This construction is again based on the intuition that 
many workloads feature redundancy in file contents. In 
this construction, the block is compressed before encryp- 
tion. If the compression level of the block content is high 
enough, then a message authentication code (i.e., MAC) 
of the block can be stored in the block itself, reducing the 
amount of storage necessary for integrity. The authenti- 
cation information for blocks that can be compressed is 
stored on untrusted storage, and consequently a MAC is 
required. Like in the previous construction, a Merkle tree 
RTreep is built over the hashes of the blocks in file F that 
cannot be compressed enough, and the root of the tree is 
kept in trusted storage. In order to prevent replay attacks, 
it is necessary that block write counters are included ei- 
ther in the computation of the block MAC (in the case 
in which the block can be compressed) or in hashing the 
block (in the case in which the block cannot be com- 
pressed enough). Similarly to scheme RAND-EINT, the 
write counters for a file f’ need to be authenticated in the 
trusted storage space TS pr. 

In this construction, file blocks can be encrypted 
with any length-preserving encryption scheme, as de- 
fined in Section 4.2. In describing the scheme, we need 


compression and decompression algorithms such that 
decompress(compress(m)) = m, for any message m. 
We can also pad messages up to a certain fixed length 
by using the pad function with an output of / bytes, and 
unpad a padded message with the unpad function such 
that unpad(pad(m)) = m, for all messages m of length 
less than | bytes. We can use standard padding methods 
for implementing these algorithms [4]. To authenticate 
blocks that can be compressed, we use a message authen- 
tication code H : Ky x {0,1}* — {0,1}* that outputs 
strings of length s bits. 

The algorithm Init runs the key generation algorithm 
Ge" of the length-preserving stateful encryption scheme 
for file F' to generate key k and selects at random a key 
ka from the key space Ky of H. It outputs the tuple 
<k ,ko>. The Update, Append, Check and Delete al- 
gorithms of the COMP-EINT construction are detailed in 
Figure 5. Here L, is the byte length of the largest plain- 
text size for which the ciphertext is of length at most the 
file block length less the size of a MAC function output. 
For example, if the block size is 4096 bytes, HMAC [3] 
with SHA-1 is used for computing MACs (whose output 
is 20 bytes) and 16-byte AES is used for encryption, then 
LL, is the largest multiple of the AES block size (i.e., 16 
bytes) less than 4096 — 20 = 4076 bytes. The value of 
L, in this case is 4064 bytes. 

- In the Update(i, B) algorithm for file F’, the write 
counter for block 7 is incremented and the counter au- 
thentication information from TS pr is updated with the 
algorithm AuthCtr. Then block content B is compressed 
to B®. If the length of B° (denoted |B°|) is at most L,, 
then there is room to store the MAC of the block content 
inside the block. In this case, the hash of the previous 
block content stored at the same address is deleted from 
the Merkle tree RTreey, if necessary. The compressed 
block is padded and encrypted, and then stored with its 
MAC in the 7-th block of F’. Otherwise, if the block 
cannot be compressed enough, then its hash has to be in- 
serted into the Merkle tree RTreer. The block content 
B is then encrypted with a length-preserving stateful en- 
cryption scheme using the key for the file and is stored in 
the z-th block of F’. 

- To append a new block B to file F’ with n blocks us- 
ing the Append(B) algorithm, the counter for block n+1 
is updated first with algorithm UpdateCtr. The counter 
authentication information from trusted storage is also 
updated with algorithm AuthCtr. Block B is then com- 
pressed. If it has an adequate compression level, then the 
compressed block is padded and encrypted, and a MAC 
is concatenated at the end of the new block. Otherwise, a 
new hash is appended to the Merkle tree RTreer and an 
encryption of B is stored in the (n + 1)-th block of F’. 
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F.Update(i, B) : 
<k1,k2> <— F.enc_key 
F.UpdateCtr(i) 
F.AuthCtr() 
B° — compress(B) 
if |B°| < Le 
ifi € RArre 
RTree.DelOffsetTree(TSr, RArrr, i) 
C — Ef (F, i, pad(B°)) 
F.WriteBlock(i, C|| Hx, (i||F.GetCtr(z)||B)) 
else 
if7 € RArre 
j <— RArrp.SearchOffset (7) 


else 


append i at end of RArrz 


RTreer.UpdateTree(TS Fr, j, h(i||F.GetCtr(z)||B)) 


RTreep.AppendTree(TS p, h(i||F.GetCtr(é)||B)) 


F’.Check(i, C): 
<k1,k2> <— F.enc_key 
if F.CheckCtr() = false 
return L 
if7 € RArre 
Boo DEY (F,4,C) 
j <— RArrp.SearchOffset (7) 
if RTreep.CheckTree(TS Fr, J, 
h(i||F.GetCtr(z)||Bz)) = true 
return B; 
else 
return L 
else 
parse Cas C’||hval 
BS — unpad(D&" (F, 7, C’)) 
B, — decompress( Bf) 
if hval = Hx, (i||F.GetCtr(z)|| Bi) 





F.UpdateCtr(n + 1) 
F.AuthCtr() 
B° — compress(B) 
if |B°| < Le 
CES (F,n +1, pad(B°)) 


else 
append n + 1 at end of RArrp 


O— Et (F,n +1, B) 
F.WriteBlock(n + 1, C) 





Cu ER (F,i, B) return B; 
F.WriteBlock(i, C) else 
return L 
F.Append(B) : F.Delete(): 
<k1,k2> <— F.enc_key n <— F.blocks 
n <— F.blocks ifn € RArre 


F.WriteBlock(i, C|| Hx, (n + 1||F.GetCtr(n + 1)||B)) 


RTreer.AppendTree(TSr, h(n + 1||F.GetCtr(n + 1)||B)) 


RTree.DelOffsetTree(TSr, RArrr, n) 
delete B,, from file F’ 











Figure 5: The Update, Check, Append and Delete algorithms for the COMP-EINT construction. 


- In the Check(i, C’) algorithm for file F’, the authen- 
tication information from TS, for the block counters is 
checked first. There are two cases to consider. First, if 
the z-th block of F’ is authenticated through the Merkle 
tree RTree;, as indicated by RArrp, then the block is de- 
crypted and algorithm CheckTree is called. Otherwise, 
the MAC of the block content is stored at the end of 
the block and we can thus parse the 7-th block of F' as 
C’||hval. C’ has to be decrypted, unpadded and decom- 
pressed, in order to obtain the original block content B;. 
The value hval stored in the block is checked to match 
the MAC of the block index 7 concatenated with the write 
counter for block 7 and block content B;. 


- In the Delete algorithm for file F’, the hash of the last 
block has to be removed from the tree by calling the algo- 
rithm DelOffsetTree (described in Figure 4), in the case 
in which the last block is authenticated through RTreer. 


The construction COMP-EINT prevents against re- 
play attacks by using write counters for either comput- 
ing a MAC over the contents of blocks that can be com- 
pressed enough, or a hash over the contents of blocks 
that cannot be compressed enough, and authenticating 
the write counters in trusted storage. It meets all the se- 
curity properties of MT-EINT and RAND-EINT. 


6 Implementation 


Our integrity algorithms are very general and they can 
be integrated into any cryptographic file system in either 
the kernel or user space. For the purpose of evaluat- 
ing and comparing their performance, we implemented 
them in EncFS [14], an open-source user-level file sys- 
tem that transparently encrypts file blocks. EncFS uses 
the FUSE [12] library to provide the file system interface. 
FUSE provides a simple library API for implementing 
file systems and it has been integrated into recent ver- 
sions of the Linux kernel. 

In EncFS, files are divided into fixed-size blocks 
and each block is encrypted individually. Several 
ciphers such as AES and Blowfish in CBC mode 
are available for block encryption. We implemented 
in EncFS the three constructions that provide in- 
tegrity: MT-EINT, RAND-EINT and COMP-EINT. 
While any length-preserving encryption scheme can be 
used in the MT-EINT and COMP-EINT constructions, 
RAND-EINT is constrained to use a tweakable cipher 
for encrypting file blocks. We choose to encrypt file 
blocks in MT-EINT and COMP-EINT with the length- 
preserving stateful encryption derived from the AES ci- 
pher in CBC mode (as shown in Section 4.2), and use the 
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Figure 6: Prototype architecture. 
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storage 
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CMC tweakable cipher [15] as the encryption method 
in RAND-EINT. In our integrity algorithms, we use 
the SHA-1 hash function and the message-authentication 
code HMAC instantiated also with the SHA-1 hash func- 
tion. For compressing and decompressing blocks in 
COMP-EINT we use the zlib library. 

Our prototype architecture is depicted in Figure 6. We 
modified the user space of EncFS to include the CMC 
cipher for block encryption and the new integrity al- 
gorithms. The server uses the underlying file system 
(i.e., reiserfs) for the storage of the encrypted files. The 
Merkle trees for integrity RTree and the index arrays of 
the random-looking blocks RArr# are stored with the en- 
crypted files in the untrusted storage space on the server. 
For faster integrity checking (in particular to improve 
the running time of the SearchOffset algorithm used in 
the Update and Check algorithms of the RAND-EINT 
and COMP-EINT constructions), we also keep the array 
RArr for each file, ordered by indices. The roots of the 
trees RTreey, and the arrays IntStartp and CtrVal; or 
their hashes (if they are too large) are stored in a trusted 
storage space. In our current implementation, we use 
two extended attributes for each file F’, one for the root 
of RTreer and the second for the arrays IntStart~ and 
CtrVal, or their hashes. 

By default, EncFS caches the last block content writ- 
ten to or read from the disk. In our implementation, 
we cached the last arrays RArrp, IntStarte and CtrVal- 
used in a block update or check operation. Since these ar- 
rays are typically small (a few hundred bytes), they easily 
fit into memory. We also evaluate the effect of caching 
of Merkle trees in our system in Section 7.1. 


7 Performance Evaluation 


In this section, we evaluate the performance of the new 
randomness test and compression integrity constructions 
for encrypted storage compared to that of Merkle trees. 
We ran our experiments on a 2.8 GHz Intel D processor 
machine with 1GB of RAM, running SuSE Linux 9.3 
with kernel version 2.6.11. The hard disk used was an 


80GB SATA 7200 RPM Maxtor. 

The main challenge we faced in evaluating the pro- 
posed constructions was to come up with representative 
file system workloads. While the performance of the 
Merkle tree construction is predictable independently of 
the workload, the performance of the new integrity algo- 
rithms is highly dependent on the file contents accessed, 
in particular on the randomness of block contents. To 
our knowledge, there are no public traces that contain file 
access patterns, as well as the contents of the file blocks 
read and written. Due to the privacy implications of re- 
leasing actual users’ data, we expect it to be nearly im- 
possible to get such traces from a widely used system. 
However, we have access to three public NFS Harvard 
traces [9] that contain NFS traffic from several of Har- 
vard’s campus servers. The traces were collected at the 
level of individual NFS operations and for each read and 
write operation they contain information about the file 
identifier, the accessed offset in the file and the size of 
the request (but not the actual file contents). 

To evaluate the integrity algorithms proposed in this 
paper, we perform two sets of experiments. In the first 
one, we strive to demonstrate how the performance of 
the new constructions varies for different file contents. 
For that, we use representative files from a Linux distri- 
bution installed on one of our desktop machines, together 
with other files from the user’s home directory, divided 
into several file types. We identify five file types of in- 
terest: text, object, executables, images, and compressed 
files, and build a set of files for each class of interest. 
All files of a particular type are first encrypted and the 
integrity information for them is built; then they are de- 
crypted and checked for integrity. We report the perfor- 
mance results for the files with the majority of blocks 
not random-looking (i.e., text, executable and object) and 
for those with mostly random-looking blocks (i.e., image 
and compressed). In this experiment, all files are written 
and read sequentially, and as such the access pattern is 
not a realistic one. 

In the second set of experiments, we evaluate the ef- 
fect of more realistic access patterns on the performance 
of the integrity schemes, using the NFS Harvard traces. 
As the Harvard traces do not contain information about 
actual file block contents written to the disks, we gen- 
erate synthetic block contents for each block write re- 
quest. We define two types of block contents: low- 
entropy and high-entropy, and perform experiments as- 
suming that either all blocks are low-entropy or all are 
high-entropy. These extreme workloads represent the 
“best” and “worst’-case for the new algorithms, respec- 
tively. We also consider a “middle”-case, in which a 
block is random-looking with a 50% probability, and plot 
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the performance results of the new schemes relative to 
the Merkle tree integrity algorithm for the best, middle 
and worst cases. 


7.1. The Impact of File Block Contents on 


Integrity Performance 

File sets. We consider a snapshot of the file system 
from one of our desktop machines. We gathered files 
that belong to five classes of interest: (1) text files are 
files with extensions .txt, .tex, .c, .h, .cpp, .java, .ps, .pdf; 
(2) object files are system library files from the direc- 
tory /usr/local/lib; (3) executable files are system exe- 
cutable files from directory /usr/local/bin; (4) image files 
are JPEG files and (5) compressed files are gzipped tar 
archives. Several characteristics of each set, including 
the total size, the number of files in each set, the mini- 
mum, average and maximum file sizes and the fraction 
of file blocks that are considered random-looking by the 
entropy test are given in Table 2. 


Experiments. We consider three cryptographic file 
systems: (1) MT-EINT with CBC-AES for encrypting 
file blocks; (2) RAND-EINT with CMC encryption; (3) 
COMP-EINT with CBC-AES encryption. For each cryp- 
tographic file system, we first write the files from each 
set; this has the effect of automatically encrypting the 
files, and running the Update algorithm of the integrity 
method for each file block. Second, we read all files from 
each set; this has the effect of automatically decrypting 
the files, and running the Check algorithm of the integrity 
method for each file block. We use file blocks of size 
4KB in the experiments. 


Micro-benchmarks. We first present a  micro- 
benchmark evaluation for the text and compressed file 
sets in Figure 7. We plot the total time to write and 
read the set of text and compressed files, respectively. 
The write time for a set of files includes the time to 
encrypt all the files in the set, create new files, write the 
encrypted contents in the new files and build the integrity 
information for each file block with algorithms Update 
and Append. The read time for a set of files includes the 
time to retrieve the encrypted files from disk, decrypt 
each file from the set and check the integrity of each 
file block with algorithm Check. We separate the total 
time incurred by the write and read experiments into 
the following components: encryption/decryption time 
(either AES or CMC); hashing time that includes the 
computation of both SHA-1 and HMAC; randomness 
check time (either the entropy test for RAND-EINT or 
compression/decompression time for COMP-EINT); 
Merkle tree operations (e.g., given a leaf index, find its 
index in inorder traversal or given an inorder index of 


a node in the tree, find the inorder index of its sibling 
and parent); the time to update and check the root of the 
tree (the root of the Merkle tree is stored as an extended 
attribute for the file) and disk waiting time. 

The results show that the cost of CMC encryption and 
decryption is about 2.5 times higher than that of AES en- 
cryption and decryption in CBC mode. Decompression 
is between 4 and 6 times faster than compression and this 
accounts for the good read performance of COMP-EINT. 

A substantial amount of the MT-EINT overhead is due 
to disk waiting time (for instance, 39% at read for text 
files) and the time to update and check the root of the 
Merkle tree (for instance, 30% at write for compressed 
files). In contrast, due to smaller sizes of the Merkle 
trees in the RAND-EINT and COMP-EINT file systems, 
the disk waiting time and the time to update and check 
the root of the tree for text files are smaller. The results 
suggests that caching of the hash values stored in Merkle 
trees in the file system might reduce the disk waiting time 
and the time to update the root of the tree and improve 
the performance of all three integrity constructions, and 
specifically that of the MT-EINT algorithm. We present 
our results on caching next. 


Caching Merkle trees. We implemented a global 
cache that stores the latest hashes read from Merkle trees 
used to either update or check the integrity of file blocks. 
As an optimization, when we verify the integrity of a file 
block, we compute all the hashes on the path from the 
node up to the root of the tree until we reach a node that 
is already in the cache and whose integrity has been val- 
idated. We store in the cache only nodes that have been 
verified and that are authentic. When a node in the cache 
is written, all its ancestors on the path from the node to 
the root, including the node itself, are evicted from the 
cache. 

We plot the total file write and read time in seconds 
for the three cryptographic file systems as a function of 
different cache sizes. We also plot the average integrity 
bandwidth per block in a log-log scale. Finally, we plot 
the cumulative size of the untrusted storage US» for all 
files from each set. We show the combined graphs for 
low-entropy files (text, object and executable files) in 
Figure 8 and for high-entropy files (compressed and im- 
age files) in Figure 9. 

The results show that MT-EINT benefits mostly on 
reads by implementing a cache of size IKB, while the 
write time is not affected greatly by using a cache. The 
improvements for MT-EINT using a cache of 1KB are 
as much as 25.22% for low-entropy files and 20.34% for 
high-entropy files in the read experiment. In the follow- 
ing, we compare the performance of the three construc- 
tions for the case in which a 1KB cache is used. 
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Total size | No. files | Min. file size | Max. filesize | Avg. filesize | Fraction of random-looking blocks 
Text 5 MB ( 7 bytes .94 MB 7.11 KB 0351 
24: 808 27b 34.94 30 0.035 
jects 7 MB 15 bytes .66 MB 7.71 MB .0001 
Obj 21 28 5b 92.6 0.000 
Executables 341 MB 3029 24 bytes 13.21 MB 112.84 KB 0.0009 
Image 189 MB 1 17 bytes .24 MB 198.4 KB . 
89 64 b 2.24 98.4 0.502 
| Compressed 249 MB 2 80.44 MB 167.65 MB 124.05 MB 0.7812 





























Table 2: File set characteristics. 
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Figure 7: Micro-benchmarks for text and compressed files. 


Results for low-entropy files. For sets of files with a 
low percent of random-looking blocks (text, object and 
executable files), RAND-EINT outperforms MT-EINT 
with respect to all the metrics considered. The perfor- 
mance of RAND-EINT compared to that of MT-EINT 
is improved by 31.77% for writes and 20.63% for reads. 
The performance of the COMP-EINT file system is very 
different in the write and read experiments due to the 
cost difference of compression and decompression. The 
write time of COMP-EINT is within 4% of the write time 
of MT-EINT and in the read experiment COMP-EINT 
outperforms MT-EINT by 25.27%. The integrity band- 
width of RAND-EINT and COMP-EINT is 92.93 and 
58.25 times, respectively, lower than that of MT-EINT. 
The untrusted storage for integrity for RAND-EINT and 
COMP-EINT is reduced 2.3 and 1.17 times, respectively, 
compared to MT-EINT. 


Results for high-entropy files. For sets of files with 
a high percent of random-looking blocks (image and 
compressed files), RAND-EINT adds a maximum per- 
formance overhead of 4.43% for writes and 18.15% 
for reads compared to MT-EINT for a 1KB cache. 
COMP-EINT adds a write performance overhead of 
38.39% compared to MT-EINT, and performs within 
1% of MT-EINT in the read experiment. The aver- 
age integrity bandwidth needed by RAND-EINT and 
COMP-EINT is lower by 30.15% and 10.22%, respec- 
tively, than that used by MT-EINT. The untrusted stor- 
age for integrity used by RAND-EINT is improved by 
9.52% compared to MT-EINT and that of COMP-EINT 
is within 1% of the storage used by MT-EINT. The rea- 
son that the average integrity bandwidth and untrusted 
storage for integrity are still reduced in RAND-EINT 
compared to MT-EINT is that in the set of high-entropy 
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Figure 8: Evaluation for low-entropy files (text, object and executable files). 


files considered only about 70% of the blocks have high 
entropy. We would expect that for files with 100% high- 
entropy blocks, these two metrics will exhibit a small 
overhead with both RAND-EINT and COMP-EINT 
compared to MT-EINT (this is actually confirmed in 
the experiments from the next section). However, such 
workloads with 100% high entropy files are very unlikely 
to occur in practice. 


7.2 The Impact of File Access Patterns on 
Integrity Performance 

File traces. We considered a subset of the three NFS 
Harvard traces [9] (LAIR, DEASNA and HOME02), 
each collected during one day. We show several charac- 
teristics of each trace, including the number of files and 
the total number of block write and read operations, in 
Table 3. The block size in these traces is 4096 bytes and 
we have implemented a 1KB cache for Merkle trees. 





Number of files 


Number of writes 


Number of reads 





























LAIR 7017 66331 23281 
DEASNA 890 64091 521 
HOME02 183 89425 11815 








Table 3: NFS Harvard trace characteristics. 


Experiments. 


We replayed each of the three traces 


with three types of block contents: all low-entropy, all 
high-entropy and 50% high-entropy. For each experi- 
ment, we measured the total running time, the average 


integrity bandwidth and the total untrusted storage for 
integrity for RAND-EINT and COMP-EINT relative to 
MT-EINT and plot the results in Figure 10. We rep- 
resent the performance of MT-EINT as the horizontal 
axis in these graphs and the performance of RAND-EINT 
and COMP-EINT relative to MT-EINT. The points 
above the horizontal axis are overheads compared to 
MT-EINT, and the points below the horizontal axis rep- 
resent improvements relative to MT-EINT. The labels on 
the graphs denote the percent of random-looking blocks 
synthetically generated. 


Results. The performance improvements of 
RAND-EINT and COMP-EINT compared to MT-EINT 
are as high as 56.21% and 56.85%, respectively, for the 
HOMEO0? trace for low-entropy blocks. On the other 
hand, the performance overhead for high-entropy blocks 
are at most 54.14% for RAND-EINT (in the LAIR trace) 
and 61.48% for COMP-EINT (in the DEASNA trace). 
RAND-EINT performs better than COMP-EINT when 
the ratio of read to write operations is small, as is the 
case for the DEASNA and HOME0? trace. As this ratio 
increases, COMP-EINT outperforms RAND-EINT. 

For low-entropy files, both the average integrity band- 
width and the untrusted storage for integrity for both 
RAND-EINT and COMP-EINT are greatly reduced 
compared to MT-EINT. For instance, in the DEASNA 
trace, MT-EINT needs 215 bytes on average to update 
or check the integrity of a block, whereas RAND-EINT 
and COMP-EINT only require on average 0.4 bytes. 
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Figure 9: Evaluation for high-entropy files (mage and compressed files). 
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Figure 10: Running time, average integrity bandwidth and storage for integrity of RAND-EINT and COMP-EINT 
relative to MT-EINT. Labels on the graphs represent percentage of random-looking blocks. 


The amount of additional untrusted storage for integrity 
in the DEASNA trace is 2.56 MB for MT-EINT and 
only 7 KB for RAND-EINT and COMP-EINT. The 
maximum overhead added by both RAND-EINT and 
COMP-EINT compared to MT-EINT for high-entropy 
blocks is 30.76% for the average integrity bandwidth (in 
the HOME02 trace) and 19.14% for the amount of un- 
trusted storage for integrity (in the DEASNA trace). 


7.3 Discussion 


From the evaluation of the three constructions, it follows 
that none of the schemes is a clear winner over the others 
with respect to all the four metrics considered. Since the 
performance of both RAND-EINT and COMP-EINT is 
greatly affected by file block contents, it would be ben- 
eficial to know the percentage of high-entropy blocks in 


practical filesystem workloads. To determine statistics 
on file contents, we have performed a user study on sev- 
eral machines from our department running Linux. For 
each user machine, we have measured the percent of 
high-entropy blocks and the percent of blocks that can- 
not be compressed enough from users’ home directories. 
The results show that on average, 28% percent of file 
blocks have high entropy and 32% percent of file blocks 
cannot be compressed enough to fit a MAC inside. 


The implications of our study are that, for crypto- 
graphic file systems that store files similar to those in 
users’ home directories, the new integrity algorithms im- 
prove upon Merkle trees with respect to all four metrics 
of interest. In particular, COMP-EINT is the best op- 
tion for primarily read-only workloads when minimizing 
read latency is a priority, and RAND-EINT is the best 
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choice for most other workloads. On the other hand, for 
an application in which the large majority of files have 
high-entropy (e.g., a file sharing application in which 
users transfer mostly audio and video files), the standard 
MT-EINT still remains the best option for integrity. We 
recommend that all three constructions be implemented 
in acryptographic file system. An application can choose 
the best scheme based on its typical workload. 

The new algorithms that we propose can be applied 
in other settings in which authentication of data stored 
on untrusted storage is desired. One example is check- 
ing the integrity of arbitrarily-large memory in a secure 
processor using only a constant amount of trusted stor- 
age [5, 7]. In this setting, a trusted checker maintains a 
constant amount of trusted storage and, possibly, a cache 
of data blocks most recently read from the main memory. 
The goal is for the checker to verify the integrity of the 


untrusted memory using a small bandwidth overhead. 
The algorithms described in this paper can be used 


only in applications where the data that needs to be au- 
thenticated is encrypted. However, the COMP-EINT 
integrity algorithm can be easily modified to fit into a 
setting in which data is only authenticated and not en- 
crypted, and can thus replace Merkle trees in such ap- 
plications. On the other hand, the RAND-EINT integrity 
algorithm is only suitable in a setting in which data is en- 
crypted with a tweakable cipher, as the integrity guaran- 
tees of this algorithm are based on the security properties 
of such ciphers. 


8 Related Work 


We have focused on Merkle trees as our point of compar- 
ison, though there are other integrity protections used on 
various cryptographic file systems that we have elided 
due to their greater expense in various measures. For 
example, a common integrity method used in crypto- 
graphic file systems such as TCFS [6] and SNAD [25] is 
to store a hash or message authentication code for each 
file block for authenticity. However, these approaches 
employ trusted storage linear in the number of blocks 
in the file (either the hashes or a counter per block). In 
systems such as SFS [22], SFSRO [11], Cepheus [10], 
FARSITE [1], Plutus [17], SUNDR [20] and IBM Stor- 
ageTank [23, 27], a Merkle tree per file is built and the 
root of the tree is authenticated (by either digitally sign- 
ing it or storing it in a trusted meta-data server). In SiR- 
iUS [13], each file is digitally signed for authenticity, and 
so in addition the integrity bandwidth to update or check 
a block in a file is linear in the file size. Tripwire [19] is a 
user-level tool that computes a hash per file and stores it 
in trusted storage. While this approach achieves constant 


trusted storage for integrity per file, the integrity band- 
width is linear in the number of blocks in the file. For 
journaling file systems, an elegant solution for integrity 
called hash logging is provided by PFS [32]. The hashes 
of file blocks together with the file system metadata are 
stored in the file system log, a protected memory area. 
However, in this solution the amount of trusted storage 
for integrity for a file is linear in the number of blocks in 
the file. 

Riedel et al. [28] provides a framework for extensively 
evaluating the security of storage systems. Wright et 
al. [33] evaluates the performance of five cryptographic 
file systems, focusing on the overhead of encryption. 
Two other recent surveys about securing storage systems 
are by Sivathanu et al. [31] and Kher and Kimand [18]. 


9 Conclusion 


We have proposed two new integrity constructions, 
RAND-EINT and COMP-EINT, that authenticate file 
blocks in a cryptographic file system using only a con- 
stant amount of trusted storage per file. Our construc- 
tions exploit the typical low entropy of block contents 
and sequentiality of file block writes to reduce the addi- 
tional costs of integrity protection. We have evaluated 
the performance of the new constructions relative to the 
widely used Merkle tree algorithm, using files from a 
standard Linux distribution and NFS traces collected at 
Harvard university. Our experimental evaluation demon- 
strates that the performance of the new algorithms is 
greatly affected by file block contents and file access 
patterns. For workloads with majority low-entropy file 
blocks, the new algorithms improve upon Merkle trees 
with respect to all the four metrics considered. 
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Abstract 


Application-level protocol specifications are useful for 
many security applications, including intrusion preven- 
tion and detection that performs deep packet inspection 
and traffic normalization, and penetration testing that 
generates network inputs to an application to uncover po- 
tential vulnerabilities. However, current practice in de- 
riving protocol specifications is mostly manual. In this 
paper, we present Discoverer, a tool for automatically 
reverse engineering the protocol message formats of an 
application from its network trace. A key property of 
Discoverer is that it operates in a protocol-independent 
fashion by inferring protocol idioms commonly seen in 
message formats of many application-level protocols. 
We evaluated the efficacy of Discoverer over one text 
protocol (HTTP) and two binary protocols (RPC and 
CIFS/SMB) by comparing our inferred formats with true 
formats obtained from Ethereal [5]. For all three proto- 
cols, more than 90% of our inferred formats correspond 
to exactly one true format; one true format is reflected 
in five inferred formats on average; our inferred formats 
cover over 95% of messages, which belong to 30-40% of 
true formats observed in the trace. 


1 Introduction 


Application-level protocol specifications are useful for 
many security applications. Penetration testing can lever- 
age protocol specifications to generate network inputs to 
an application to uncover potential vulnerabilities. For 
network management, protocol specifications can also be 
used to identify protocols and tunnelings in monitored 
network traffic. Generic protocol analyzers (GAPA [1] 
and binpac [16]) are important mechanisms for intrusion 
detection or firewall systems to perform deep packet in- 
spection. These analyzers take protocol specifications as 


input for their analyses. 

To date, protocol specifications for the above applica- 
tions are specified from documentation or reverse engi- 
neered manually. Such efforts are painstakingly time- 
consuming and error-prone. It took the open-source 
SAMBA project 12 years to manually reverse engineer 
the Microsoft SMB protocol [18]. In another exam- 
ple, the Yahoo messenger protocol has also been persis- 
tently reverse engineered, despite which, the open source 
clients [6] regularly require patching to support propri- 
etary changes in the Yahoo protocol. Sometimes, the pe- 
riod between the availability of an official client and an 
open-source client has been a month, with some open- 
source projects simply abandoning the effort due to the 
frequent changes initiated by Yahoo. 

To address this pain, we tackle the problem of auto- 
matic protocol reverse engineering. There can be two 
sources of given input for the reverse-engineering task: 
network traces and application code. In this paper, 
we present our tool, Discoverer, which performs au- 
tomatic reverse engineering from network traces. We 
leave application-code-based reverse engineering as fu- 
ture work. 

In Discoverer, we focus on reverse engineering the 
message format specification and leave the protocol state 
machine inference to our future work. To automatically 
reverse engineer message formats for a wide range of 
protocols, we face three main challenges: (1) We have 
very few hints from the network trace. The only evi- 
dent information from the trace is the directionality of 
byte streams. (2) Protocols are significantly different 
from each other. (3) Protocol message formats are often 
context-sensitive where earlier fields dictate the parsing 
of the subsequent part of the message. 

To make our tool general, we base our design on infer- 
ring protocol idioms commonly seen in message formats 
of many protocols. To cope with the few hints, we dissect 
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the formless byte streams into text and binary segments 
or tokens as a starting point for clustering messages with 
similar patterns, where each cluster approximates a mes- 
sage format. By comparing messages in a cluster and 
observing the characteristics of known cross-field de- 
pendencies (such as a length field followed by a string 
of the length), we infer additional properties for the to- 
kens, which in turn can be leveraged to refine and divide 
the clusters of messages, where each subcluster approxi- 
mates a more precise format. This process continues re- 
cursively until we can no longer divide up any message 
clusters based on the newly finished inference. After this 
recursive clustering phase, we look at all message clus- 
ters globally through a type-based sequence alignment 
algorithm, and merge similar clusters into one. This way, 
we can produce more concise message formats. 


We have evaluated Discoverer over traces of a repre- 
sentative set of protocols consisting of one text protocol 
(HTTP) and two binary protocols (RPC and CIFS/SMB). 
We calibrated our design over some of these traces, and 
used the remaining for validation. The three main met- 
rics for our tool are correctness (“does one inferred for- 
mat correspond to exactly one true format?’), concise- 
ness (“how many inferred formats is a single true format 
reflected in?”), and coverage (“how many messages are 
covered by the inferred formats?’’). Across all protocols 
we tested, more than 90% inferred formats correspond 
to exactly one true format; one true format is reflected 
in five inferred formats on average; our inferred formats 
cover over 95% messages, which belong to 30-40% of 
true formats observed in the trace. Such significant dif- 
ference between message and format coverage is due to 
the heavy-tail distribution of message format popularity 
commonly seen in practice. 


Although our reverse-engineered message formats are 
imperfect, we anticipate them to be still practical for 
the aforementioned applications. For instance, penetra- 
tion testing guided by our reverse-engineered formats is 
likely to be much more effective than that with random 
inputs. Protocol fingerprinting and tunneling detection 
probably do not require perfect protocol specifications. 
For applications like firewalls which would err with im- 
perfect specifications, our tool could still serve as a help 
to ease the manual protocol specification process. 


We organize the rest of the paper as follows. We dis- 
cuss common protocol idioms and the scope of Discov- 
erer in Section 2. We describe the design of Discoverer in 
detail in Section 3. We present our evaluation methodol- 
ogy and results in Section 4. We discuss related work in 
Section 5, and limitations and future work in Section 6. 
Finally, we summarize the paper in Section 7. 


2 Problem Statement 


Many application-level protocols share common pro- 
tocol idioms which correspond to the essential compo- 
nents in a protocol specification. To make our reverse- 
engineering algorithm applicable to many protocols, we 
base our design on inferring the common protocol id- 
ioms. In this section, we first describe these idioms and 
then explain the scope of Discoverer. 


2.1 Common Protocol Idioms 


Most application-level protocols involve the concept 
of an application session, which consists of a series of 
messages (also known as Application-level Data Units 
or ADUs) between two hosts that accomplishes a spe- 
cific task. The structure of an application session is de- 
termined by the application’s protocol state machine, an 
essential component in a protocol specification that char- 
acterizes all possible legitimate sequences of messages. 
The structure of an application message is determined by 
the application’s message format specification, another 
essential component in a protocol specification. A mes- 
sage format specifies a sequence of fields and their se- 
mantics. Common field semantics include length (re- 
flecting the size of a subsequent field with a variable 
length), offset (determining the byte offset of another 
field from a certain point like the start of the message), 
pointer (a special offset that specifies the index of a field 
in an array of arbitrary items), cookie (session-specific 
opaque data that appears in messages from both sides of 
the application session; session IDs are an example of 
cookie fields), endpoint-address (encoding IP addresses 
or port numbers in some form), and set (a group of fields 
that can be put in an arbitrary order). 

One particular type of fields is the Format Distin- 
guisher (FD) field. The value of this field serves to differ- 
entiate the format of the subsequent part of the message, 
which reflects the context-sensitive nature in the gram- 
mar of many application-level protocols. A message may 
have a sequence of FD fields, particularly when multiple 
protocols are encapsulated. For instance, a CIFS/SMB 
message consists of a NetBIOS header encapsulating an 
SMB header, which in turn may encapsulate a RPC mes- 
sage. This implies that the applications need to scan a 
message from /eft-to-right, decoding a FD field before 
parsing the subsequent part of the message. 


2.2 Scope of Discoverer 


In this paper, we focus on deriving the message format 
specification and leave protocol state machine inference 
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Figure 1: Overview of Discoverer’s architecture. In the example, we assume there is a single true message format 
which has two fields: the first binary field of a single byte represents the length of the second text field. There are two 
token patterns because, when the text field is shorter than a threshold, it is treated as binary. In the merging phase, this 


kind of tokenization errors is corrected. 


to our future work. We assume synchronous protocols to 
identify message boundaries. A message is a consecu- 
tive chunk of application-level data sent in one direction. 
It spans one or more packets in a TCP or UDP connec- 
tion, where a UDP connection is a pair of unidirectional 
UDP flows that are matched on source/destination IP ad- 
dress/port number. We only aim to deal with applications 
that do not obfuscate their payloads. We do not aim to 
capture timing semantics (e.g., “message 1 usually fol- 
lows message 2 within 10 seconds”). 


3 Design 


In this section, we first present an overview of Discov- 
erer, then describe the three main phases of Discoverer in 
detail, and finally give a concrete example of a message 
format inferred by Discoverer. 


3.1 Overview 


The basic idea of Discoverer is to cluster messages 
with the same format together and infer the message 
format by comparing messages in a single cluster. We 
achieve this in three main phases (illustrated in Figure 1). 


e Tokenization and Initial Clustering: This phase 
operates on the raw packets, and helps in identifying 
field boundaries in a message and giving the first or- 
der structure to the unlabeled messages. We first re- 
assemble the packets into messages, and then break 
up a message into a sequence of tokens which is an 


approximation to a sequence of fields. Tokens be- 
long to one of two token classes: binary or text. We 
then classify messages into various clusters based 
on each message’s token pattern, which is simply 
represented by the message direction and classes of 
its tokens. 


Recursive Clustering: Since messages with the 
same token pattern do not necessarily have the same 
format, this phase further divides clusters of mes- 
sages so that messages in each cluster have the same 
format, and infers the message format by compar- 
ing messages in each single cluster. To do so, we 
mimic the left-to-right recursive parsing of applica- 
tions processing messages by recursively repeating 
the following steps. We first infer the message for- 
mat that captures the content of all messages in a 
cluster. Then we identify the first FD field (which 
decides the format of the subsequent part of the 
message) in a left-to-right scan and use the values 
of this FD field to divide the cluster into subclus- 
ters. 


Merging: This phase mitigates the over- 
classification problem, namely, messages of 
the same format may be scattered into multiple 
clusters. To do so, we merge similar message 
formats by using a type-based sequence alignment 
algorithm that compares the field structure of two 
inferred message formats. 
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A key design rationale for Discoverer is to be conser- 
vative: it may scatter messages of the same format into 
more than one cluster, but it should not collate messages 
of different formats into the same cluster. This rationale 
is to ensure the correctness of inferred formats because, 
if there are messages of more than one format in a clus- 
ter, the inferred format might be too general by trying to 
capture multiple message formats at once. 


3.2 Tokenization and Initial Clustering 
3.2.1 Tokenization 


A token is a sequence of consecutive bytes likely to 
belong to the same application-level field. We require 
that the tokenization process works without any particu- 
lar distinction between text and binary protocols, since 
our tool is intended to be fully automatic and we wish 
to spare the user from the manual effort required to dis- 
tinguish between text and binary protocols. Further, it 
is hard to declare a protocol as purely text or purely bi- 
nary, since text protocols can contain binary bytes (e.g., 
an image file transferred over HTTP) and most binary 
protocols contain a few text fields (e.g., the name of a 
file). 

Our tokenization procedure generates two classes of 
tokens: text and binary. A text token is intended to span 
the several bytes of a single message field representing 
some text (such as “GET” in an HTTP request). Our pro- 
cedure for finding text tokens is as follows: we first iden- 
tify text bytes by comparing them with the ASCII val- 
ues of printable characters, and then consider a sequence 
of text bytes sandwiched between two binary bytes as 
a text segment. To avoid mistaking binary bytes for 
text bytes, we require this sequence to have a minimum 
length. Then we use a set of delimiters (e.g., space and 
tab) to divide a text segment into tokens. We also look 
for Unicode encodings in messages. For binary fields, 
identifying field boundaries is very hard; so we instead 
simply declare a single binary byte to be a binary to- 
ken in its own right. Note that this procedure can admit 
errors: consecutive binary bytes with ASCII values of 
printable characters are wrongly marked as a text token; 
a text string shorter than the minimum length is wrongly 
marked as binary tokens; a text field consisting of some 
white space characters is wrongly divided into multiple 
text tokens. We correct this kind of errors in the merging 
phase (see Section 3.4). 


3.2.2 Initial Clustering by Token Patterns 


Byte-wise sequence alignment based on_ the 
Needleman-Wunsch algorithm [15] has been used 


in previous studies [3, 11,17] for aligning and comparing 
messages. We find that byte-wise sequence alignment, 
while ideally suited to align messages with similar byte 
patterns, is not suitable for aligning messages with the 
same format. For instance, fields with variable lengths 
may lead to mis-alignment of two messages of the 
same format. Further, parameter selection for sequence 
alignment is also hard as shown in [3]. 


To avoid aligning messages, we cluster mes- 
sages based on their token patterns. The to- 
ken pattern assigned to a message is a _ tuple, 


(dir, class_of _token_1, class_of _token_2,---), where 
dir is the direction of the message (client to server 
or vice versa), followed by the classes of all tokens 
in the message. We consider the message direction 
because messages in opposite directions tend to have 
different formats. An example of a token pattern is 
(client_to_server, text, binary, text). 

Note that this initial clustering is coarse-grained since 
messages with different formats may have the same to- 
ken pattern. For instance, SMTP commands typically 
have two text tokens (“MAIL receiver’, “RCPT sender’, 
“HELO server-name’’). In the recursive clustering phase, 
we improve the granularity of this clustering by recur- 
sively identifying FD tokens and dividing clusters. 


3.3. Recursive Clustering 


Our recursive clustering relies on identifying format 
distinguisher (FD) tokens. To find FD tokens, we need 
to invoke both format inference and format comparison. 
In this section, we first explain these procedures before 
describing how we recursively identify FD tokens and 
divide clusters. 


3.3.1 Format Inference 


The format inference phase takes as input a set of mes- 
sages and infers a format that succinctly captures the 
content of the set of messages. Our inferred message 
format is defined to be a sequence of token specifica- 
tions which include not only token semantics but also to- 
ken properties. We introduce token properties because 
we cannot infer the semantic meaning for every token 
and certain token properties are useful for describing the 
message format. Token properties currently cover two 
perspectives: binary vs. text and constant vs. variable. 
The first property reflects the token class, and the second 
decides if a token takes the same value across all mes- 
sages of the same format (i.e., constant token) or differ- 
ent values in different application sessions (i.e., variable 
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token). We also define the type of a token to be the sum 
of its semantic and property. 

We now describe how token properties and semantics 
are derived. 

Property Inference: Token class is already identified 
during the tokenization phase. Constant or variable to- 
kens can also be easily identified. Since the set of mes- 
sages come from a single token-pattern cluster, tokens 
in one message can be directly compared against their 
counterparts in another message by simply using the to- 
ken offset. Thus, constant tokens are those that take the 
same value across the entire set of messages, and variable 
tokens are those that take more than one value. 

Semantic Inference: We currently support three se- 
mantics: length, offset, and cookie (see Section 2.1 for 
definitions). We will discuss how it may be possible to 
support other semantics in Section 6. We identify cookie 
fields at the end of the merging phase since it requires 
correlating multiple messages in the same session. We 
employ the heuristics in RolePlayer [3] for doing this. 
Our heuristics for identifying length and offset fields are 
an extension of those in RolePlayer. The intuition for 
identifying length fields is that, for a specific pair of 
messages, the difference in the values of potential length 
fields (at most four consecutive binary tokens or a text to- 
ken in the decimal or hex format) reflects the difference 
of the sizes of the messages or some subsequent tokens. 
We thus simply check for a match between the value dif- 
ference and the size difference. If a match holds for all 
pairs of messages in the cluster, the potential length field 
is declared to be a length field. For offset fields, we com- 
pare the value difference with the difference of the offsets 
of some subsequent tokens. 


3.3.2 Format Comparison 


The goal of this procedure is to decide if two in- 
ferred message formats are the same. Given two formats, 
it scans these two formats token-by-token from left-to- 
right and matches the inferred type (i-e., semantic and 
property) of a token from one format against its counter- 
part from the other. If all tokens match, the two formats 
are considered to be the same. 

Ideally, two tokens can be considered to match if their 
semantics match. However, since there are always tokens 
that we do not have semantics for, we need to compare 
their values (they have the same token class since the two 
formats have the same token pattern). We allow a con- 
stant token to match with a variable token if the latter 
takes the value of the former at least once. We also al- 
low a variable token to match with another if there is an 
overlap in the two sets of values taken by them. Note that 


these policies are conservative, which is in line with our 
design rationale. 


3.3.3 Recursive Clustering by Format Distinguish- 
ers 


We identify FD tokens with the following algorithm. 
First, we invoke format inference on the set of messages 
in a cluster. Then, we scan the format token by token 
from left to right to identify FD tokens. We use three 
criteria in determining if a token is a FD: 


1. We first check if the number of unique values taken 
by this token across the set of messages is less than 
a threshold, referred to as the maximum distinct val- 
ues for a FD token. This is because a FD token typ- 
ically takes a few values corresponding to the num- 
ber of different formats. 


2. For tokens satisfying the first criterion, we perform 
a second test as follows. The cluster is divided into 
subclusters, one for each unique value taken by this 
token. Each subcluster consists of messages where 
the candidate FD token takes a specific value. We 
then require that the size of the largest subcluster ex- 
ceeds a threshold, referred to as the minimum clus- 
ter size. This is to guarantee that we can make a 
meaningful format inference in at least one subclus- 
ter. Otherwise, we gain nothing by continuing this 
splitting. 


3. If the potential FD token passes the second crite- 
rion, we invoke format comparison across subclus- 
ters to see if their formats are different from each 
other. We then merge those that manifest the same 
formats and leave others intact. 


This process is recursively performed on each of the 
subclusters because a message may have more than one 
FD token. We find the next FD token by scanning further 
down the message towards the right (end) of the message. 
It is necessary to scan all the way to the end because we 
need to recognize all FDs to obtain a good clustering and 
format inference. 

When looking for the next FD token, the format in- 
ference is invoked again on the set of messages in each 
subcluster. This is because the inferred token properties 
and semantics might change because the set of messages 
has become smaller, and it is possible for stronger prop- 
erties to hold. For instance, a previously variable token 
might now be a constant token; a previously variable to- 
ken might now be identified as a length field. 





USENIX Association 


16th USENIX Security Symposium 


203 


3.4 Merging with Type-Based Sequence 
Alignment 


In the tokenization and recursive clustering phases, we 
are conservative to ensure that the format inference pro- 
cedure operates correctly on a set of messages of the 
same format. However, this leads to a new problem of 
over-classification, namely, messages of the same format 
may be scattered into more than one cluster. This prob- 
lem can be quite severe; for instance, over a CIFS/SMB 
trace of almost four million messages, there are about 
7,000 clusters/formats as input to this phase, while the to- 
tal number of true formats is 130. The goal of the merg- 
ing process is to coalesce similar formats from different 
clusters into a single one. 

The key observation behind our merging phase is that, 
while sequence alignment [15] cannot be used for clus- 
tering messages of the same format, it can be used to 
align formats for identifying similar ones across differ- 
ent clusters. This is because we can leverage the rich to- 
ken types (i.e., semantics and properties) inferred in the 
recursive clustering phase. For instance, knowing that a 
particular token is a length field in a format necessitates 
that its counterpart in another format is also a length field 
for these two formats to be considered a match. We re- 
fer to our algorithm for aligning formats as type-based 
sequence alignment. 

In our type-based sequence alignment, we only allow 
two tokens of the same class (binary or text) to align with 
each other. We claim two aligned tokens are matched if 
they either have the same semantic or share at least one 
value (see Section 3.3.2 for details). 

To compensate for tokenization errors, we allow gaps 
in our type-based sequence alignment. In addition to us- 
ing gap penalties to control gaps, we introduce extra con- 
straints to avoid excessive gaps. First, consecutive binary 
tokens in one message format are allowed to align with 
gaps if they precede or follow a text token in the other 
message format in the alignment, and the number of bi- 
nary tokens is at most the size of the text token if the text 
token is aligned with a gap, or the size difference if it 
is aligned with another text token. This constraint is for 
handling the case of mistaking a sequence of binary to- 
kens to be a text token or vice versa. Second, a text token 
is allowed to align with a gap, but we allow at most two 
gaps of this kind. This constraint is for handling the case 
that a text field consisting of some white space characters 
is mistakenly divided into multiple tokens. 

When we align and compare two message formats to 
decide whether to merge them, we first check if the gap 
constraints can be satisfied. If no, we stop and claim the 
two formats are mismatched; otherwise, we continue to 


check the number of mismatches. If there is at most one 
pair of aligned tokens mismatched, we claim the two for- 
mats are matched and merge them. Note that this is con- 
servative because the mismatched token can be treated as 
a variable token that takes values from a new set covering 
both formats. 

Since we use the gap constraints and the number of 
mismatches to decide whether to merge two message for- 
mats, our merging performance is insensitive to sequence 
alignment parameters—scores for match, mismatch and 


gap. 


3.5 An Example 


For better understanding, here we present a concrete 
example based on the SMB “Tree Connect AndX Re- 
quest” message format to explain the design and output 
of Discoverer. We obtain the true message format from 
Ethereal (see Figure 2 and Figure 3). The final inferred 
format by Discoverer is shown in Table 1. 

We can see that the inferred format is a sequence of 
tokens with token properties (binary vs. text, constant 
vs. variable) and semantics (e.g., length fields). For to- 
kens with unknown semantics, their possible values are 
also taken into account in the format. Before the merg- 
ing step, messages of this true format were scattered into 
24 clusters in 18 different token patterns. Different token 
patterns are due to the “smb.signature” field. Since this 
field may take any random values, we will have a differ- 
ent token pattern when more than three consecutive bytes 
at a different offset take values from the printable ASCII 
range and are wrongly treated as a text token. Messages 
in some token patterns were further split into fine-grained 
clusters in the recursive clustering phase due to our con- 
servative approach. Our merging technique mitigates this 
over-classification problem effectively. At the end, all of 
the 24 clusters were merged into a single one. 

This example also shows the possibility of imprecise 
field boundaries. For example, the first null byte of the 
field “‘smb.nt_status” was treated as the null terminator 
for the text token before it. However, we believe this 
kind of imprecision will not affect the effectiveness of 
the inferred format but instead create some extra inferred 
formats with different values for “smb.nt_status”’. 


4 Evaluation 


We implemented Discoverer in 5,700 lines of C++ 
code on Windows. The tool takes a network capture file 
either in the libpcap [12] or Netmon [14] format as in- 
put and outputs inferred message formats: a sequence of 
tokens with the inferred properties and semantics. Our 
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<proto name="nbss" 
<field name="nbss.type" 
<field show="Length: 

</proto> 

<proto name="smb" showname="SMB 
<field show="SMB Header" size="32" pos="58"> 
<field show="Server Component: 


showname="Message Type: 


<field name="smb.cmd" showname="SMB Command: Tree Connect AndxX 
value="75"/> 
<field name="smb.nt_status" showname="NT Status: STATUS_ 


show="0x00000000" value="00000000"/> 


(Server Message Block Protocol)" 


showname="NetBIOS Session Service" size="4" pos="54"> 
Session message" size="1" pos="54" show="0" value="00"/> 
156" size="3" pos="55" value="00009c"/> 


size="156" pos="58"> 


SMB" size="4" pos="58" value="f££534d42"/> 


(0x75)" size="1" pos="62" show="0x75" 


SUCCESS (0x00000000)" size="4" pos="63" 


<field show="Flags: 0x18" size="1" pos="67" value="18"> 
<field show="Flags2: Oxc807" size="2" pos="68" value="07c8"> 
<field name="smb.pid.high" showname="Process ID High: 


<field name="smb.signature" showname="Signature: 


show="05:a0:96:37:b7:41:91:66" value="05a09637b7419166"/> 
<field name="smb.reserved" showname="Reserved: 0000" 
<field name="smb.tid" showname="Tree ID: 0" 
<field name="smb.pid" showname="Process ID: 65279" 
<field name="smb.uid" showname="User ID: 2048" size="2" 
<field name="smb.mid" showname="Multiplex ID: 

</field> 

<field show="Tree Connect AndX Request (0x75)" 











0" size="2" pos="70" 
05A09637B7419166" 


size="2" pos="97" 


size="1" pos="101" 
\\SP-SIN-DCF-01.SOUTHPACIFIC.CORP .MICROSOFT.COM\IPC$" 


show="0" 
size="8" pos="72" 


value="0000"/> 


size="2" pos="80" show="00:00" value="0000"/> 
size="2" pos="82" show="0" value="0000"/> 
size="2" pos="84" show="65279" value="fffe"/> 


pos="86" show="2048" value="0008"/> 


128" size="2" pos="88" show="128" value="8000"/> 


size="124" pos="90"> 

size="1" pos="90" show="4" value="04"/> 
(Oxff)" size="1" pos="91" value="ff"/> 
size="1" pos="92" show="00" value="00"/> 


size="2" pos="93" show="156" value="9c00"/> 


show="1" value="0100"/> 


113" size="2" pos="99" show="113" value="7100"/> 


show="00" value="00"/> 


CORP .MICROSOFT.COM\\IPC$" value="5c005c00..../> 


<field name="smb.wct" showname="Word Count (WCT): 4" 
<field show="AndxXCommand: No further commands 
<field name="smb.reserved" showname="Reserved: 00" 
<field name="smb.andxoffset" showname="AndxOffset: 156" 
<field name="smb.connect.flags" size="2" pos="95" value="0c00"> 
<field name="smb.pwlen" showname="Password Length: 1" 
<field name="smb.bcc" showname="Byte Count (BCC): 
<field name="smb.password" showname="Password: 00" 
<field name="smb.path" showname="Path: 

size="106" pos="102" show="\\\\SP-SIN-DCF-01.SOUTHPACIFIC. 
<field name="smb.service" showname="Service: 

</field> 

</proto> 


Figure 2: Ethereal’s XML output of an example SMB “Tree Connect AndX Request” message (edited for better 


presentation). 


current un-optimized implementation takes about 6-12 
hours for a trace of several million messages (the merg- 
ing procedure is the slowest due to the need of pairwise 
comparisons of all inferred formats). Before discussing 
the experimental results, we first describe our data sets 
and evaluation methodology. 


4.1 Data Sets 


We tested Discoverer on traces from two different 
sites: a honeyfarm site [2] (which responds to un- 
solicited, mostly malicious traffic) and a busy enterprise 
(which has diverse and high-volume traffic). The hon- 
eyfarm trace consists of CIFS/SMB only. The enterprise 
trace includes HTTP, CIFS/SMB, and RPC. The honey- 
farm trace and the HTTP trace were used as the calibra- 
tion data to help guide the design process and set tunable 
parameters. Our results are presented based on the output 
of our tool on the traces from the enterprise site, which 
served as the evaluation data. Thus, CIFS/SMB can be 
seen as the evaluation case where the tool was trained on 
the trace from a different site, whereas RPC is the case 


where the tool is evaluated over a new protocol. Though 
CIFS/SMB messages may encapsulate the RPC layer, the 
RPC trace consists of RPC traffic exclusively. The HTTP 
trace was used for both calibration and evaluation, but we 
hardly tailored our tool to HTTP. 

In our experiment, the CIFS/SMB and RPC trace from 
the enterprise site contains traffic in one direction only. 
This will not affect our evaluation because the proto- 
col formats in both directions are equally complicated 
based on Ethereal’s parsing results of the honeyfarm 
CIFS/SMB trace. This is not to say that if we can infer 
the format in one direction, we are guaranteed to infer the 
format in the other direction; but the performance in one 
direction does give an indication of the performance in 
the other direction. In addition, since we do not put mes- 
sages in different directions into the same cluster, uni- 
directional traffic does not make the problem any easier. 

For the HTTP trace, our tool reassembled consecu- 
tive data sent in one direction into a message. For the 
CIFS/SMB and RPC traces, we leveraged Ethereal to 
parse them and identify message boundaries. A summary 
of these traces is shown in Table 2. 
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nbss.type; Length; Server Component; smb.cmd;smb.nt_status;smb.flags;smb.flags2;smb.pid.high; 


smb.signature; smb.reserved; 


smb.tid;smb.pid; smb.uid; smb.mid; smb.wct; AndXCommand; smb. reserved; 


smb.andxoffset;smb.connect.flags; smb.pwlen; smb.bcc; smb.password; smb.path; smb.service 


Figure 3: The “name” of the true format for the example message in Figure 2 concatenates the human readable names 















































of all the fields. 
Token True Field Token True Field Token True Field Token True Field 
C(b,00) nbss.type C(b,00) smb.pid.high C(b,00) smb.tid C(b,00) 
C(b,00) Length C(b,00) C(b,00) V(b,2) smb.connect.flags 
C(b,00) V(b,256) | smb.signature | C(b,ff) smb.pid C(b,00) 
L(b) V(b,256) C(b,fe) C((b,01) smb.pwlen 
C(b, ff) Server Component | V(b,256) V(b,33) smb.uid C(b,00) 
C(tn,SMBu) | smb.cmd V(b,256) V(b,32) L(b) smb.bcc 
C(b,00) smb.nt_status V(b,256) V(b,13) smb.mid C00) 
C(b,00) V(b,256) V(b,211) C(00) smb.password 
C(b,00) V(b,256) C(b,04) smb.wct V(tun,664) smb.path 
C(b,18) smb.flags V(b,256) C(b,ff) AndXCommand | C(tn,?????) | smb.service 
C(b,07) smb.flags2 C(b,00) smb.reserved C(b,00) smb.reserved 
C(b,c8) C(b,00) L(b) smb.andxoffset 





























Table 1: Discoverer’s inferred format for the true format in Figure 3. For C(x,y), C means constant, x means binary 
(“b”) or text (“t’; in text tokens, “u” means Unicode and “n” means it is null terminated), y is the hex value or string 
of the token; for V(x,z), z is the number of different values for the token. 





























Protocol Source Size (B) | # Messages | # True Formats 
HTTP Enterprise 4.6G 5,950,453 2,696 
RPC Enterprise 179M 351,818 50 
CIFS/SMB | Enterprise 1.0G 3,818,267 301 
CIFS/SMB | Honeyfarm 1.1G 1,439,744 1220 














Table 2: Summary of network traces used in the evaluation. 


4.2 Evaluation Methodology 


Our evaluation methodology is to compare the quality 
of our output with the set of true message formats. To ob- 
tain the true format, instead of trying to manually extract 
it from documentation and RFCs, we used the protocol 
analyzers in Ethereal [5]. Ethereal can parse a network 
trace and produce, for each message in the trace, an XML 
output that includes the list of fields in the message, the 
values of those fields, some human readable names and 
their sizes. Based on this output, we assign every mes- 
sage a true format “name”, which is simply the concate- 
nation of the human readable names of all the fields. An 
example of Ethereal’s XML output and the true format 
name is shown in Figure 2 and Figure 3. 

We characterize the performance of our tool and high- 
light the results in the following metrics: 


e Correctness: If a cluster contains messages from 
more than one true format, then Discoverer will 


make incorrect inference. Thus we measure the cor- 
rectness by checking the number of different true 
formats followed by the messages in a cluster. For 
all three protocols, over 90% clusters contain mes- 
sages from a single true format. 


e Conciseness: Our conservative clustering may 
cause multiple inferred formats to cover subsets of 
a single true format. A large number of redundant 
formats will affect the conciseness of the protocol 
specifications generated by our tool. Thus we mea- 
sure conciseness by the ratio from the number of 
inferred formats to the number of true formats fol- 
lowed by their messages. For all three protocols, we 
achieved a low 5 to | ratio. 


e Coverage: We measure the trace coverage from two 
perspectives: the fraction of messages covered by 
our inferred formats and the fraction of true formats 
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Parameter Value 
Maximum message prefix 2048 bytes 
Minimum length of text segments 3 letters 
Minimum cluster size 20 messages 
Maximum distinct values for FD 10 
Alignment match score 1 
Alignment mismatch score 

Alignment gap score -2 





Table 3: Summary of parameters. 


followed by covered messages. For all the three pro- 
tocols, our message coverage is above 95% while 
our format coverage is around 30-40%. 


As for the semantic inference, all the length fields in- 
ferred by Discoverer are correct; certain length fields are 
missed due to the trace limitation. For instance, some 
true formats in CIFS/SMB have a fixed message size. In 
this case, Discoverer will treat the length fields that re- 
flect the message size as constant tokens, and it will not 
affect parsing messages of these formats in practice. 


4.3. Tunable Parameters 


Discoverer has just a few tunable parameters (see Ta- 
ble 3). For a message larger than 2048 bytes, we only 
consider the first 2048 bytes, referred to as the maximum 
message prefix. The minimum length of text segments 
controls the tokenization procedure (Section 3.2.1). The 
minimum cluster size and the maximum distinct values 
for FD are used in the recursive clustering phase (see 
Section 3.3.3). The match/mismatch/gap scores are pa- 
rameters for sequence alignment [15]. We observed that 
the performance of Discoverer is not sensitive to the set- 
tings of these parameters. For instance, we saw similar 
performance when we changed the maximum prefix size 
from 2048 bytes to 1024 bytes or changed the minimum 
cluster size from 20 messages to 10 messages. In ad- 
dition, our type-based sequence alignment is not sensi- 
tive to the match/mismatch/gap scores as we discussed 
in Section 3.4. Thus we take the same parameters for our 
evaluations on all three protocols. 


In the rest of this section, we present the experi- 
mental results on the enterprise traces for HTTP, RPC, 
and CIFS/SMB. Note that we use the inferred format 
and cluster interchangeably because we infer one format 
from each cluster. 
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Figure 4: Heavy-tail distribution of message format pop- 
ularity in HTTP. 
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Figure 5: Correctness for HTTP: CDF of the number of 
true formats followed by messages of a cluster for all 
clusters before the merging phase. 


44 HTTP 


The HTTP protocol allows an arbitrary number of “pa- 
rameter: value” pairs in an arbitrary order. We refer this 
to be the “set” semantic. Currently we are unable to iden- 
tify this set semantic. So we treat each ordering of the set 
elements as a distinct format. By doing so, we observed 
2,696 formats from the parsing results of Ethereal. We 
leave the identification of set semantic to be future work 
(see Section 6). 

In Figure 4, we show the number of messages of each 
true format in the HTTP trace. Note that the y-axis is 
in logarithmic scale. This clearly reveals the heavy-tail 
distribution; most messages (more than 99%) fall in the 
first top 1000 true formats. We observed a similar trend 
in the RPC and CIFS/SMB trace as well. The implica- 
tion for our tool is that the format coverage and message 
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coverage are likely to be very different; the latter will 
be much higher compared to the former. In HTTP, we 
inferred 3,926 formats, which covered 5,938,511 out of 
5,950,453 messages (99.8%). The covered messages be- 
long to 865 out of 2,696 true formats (32%). Since we 
have a hard requirement on the minimum size of a clus- 
ter, we conjecture that the coverage ratio in terms of true 
formats will improve when the trace grows and each for- 
mat has more messages. 


Figure 5 plots the CDF of the number of true formats 
followed by messages of a cluster for all clusters before 
the merging phase. This reflects the correctness of our 
tool. This figure shows that about 90% of our inferred 
clusters are correct. They correspond to only one true 
format. The number rises to over 95% if we include in- 
ferred formats that match two true formats. 


By manually inspecting the results, we found that the 
clustering errors are mainly due to the inaccuracy in 
Ethereal parsing. For example, in some message for- 
mats, Discoverer infers that there is a token that can be 
either “Connection” or “Proxy-Connection”. Discoverer 
does not treat it as a FD because both may be followed 
by the same set of values such as “Close” or “Keep- 
Alive”. However, Ethereal does not recognize “Proxy- 
Connection” as a parameter for HTTP, and returns a null 
string for this field in its parsing result, while it returns 
“http.connection” for “Connection”. So we will have two 
true formats for a cluster that contains both “Connection” 
and “Proxy-Connection”. Thus, our conciseness number 
may improve if Ethereal has more accurate parsing. 


The merging phase reduced 4,465 clusters to 3,926 
clusters. Since the covered messages belong to 865 true 
formats, this gives us a 5 to | ratio. In fact, almost 80% 
true formats are scattered into at most five clusters. On 
one hand, our conservative strategy eliminated false pos- 
itives (i.e., wrongly merging two clusters that correspond 
to two different true formats). On the other hand, it did 
not help much in merging clusters for HTTP. The rea- 
son is as follows. HTTP allows many parameters in the 
form of “parameter: value”. We treat the “parameter:” 
and “value” as separate tokens because of the space in 
between. Since the “value” token for certain parameters 
such as “PROXY” may be arbitrary strings, it is likely 
for such “value” tokens in two clusters to not have over- 
lapped values. In this case, we will treat them as a mis- 
match. If two clusters happen to have more than one such 
mismatch, we will not merge them based on our conser- 
vative policy. 


Before Merging —+— 
After Merging ---<--- 
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Figure 6: Effectiveness of merging for RPC: Number of 
inferred formats into which messages of a single true for- 
mat are “scattered”: before and after merging. 


4.5 RPC 


The RPC trace consists of exclusively RPC traffic. 
Though the trace size is 179MB, one order less than 
the HTTP and CIFS/SMB trace, we observed the simi- 
lar trend in the distribution of the number of messages in 
each true format. Overall, we inferred 33 formats, which 
covered 340,624 out of 351,818 messages (96.8%). The 
covered messages belong to 18 out of 50 true formats 
(36%). 

The recursive clustering generated 83 clusters, among 
which 78 clusters contain messages from a single true 
format, and the rest five clusters have messages from 
two true formats. The merging phase helped reduce the 
overall clusters from 83 to 33 without introducing false 
positives. This shows that our merging phase compen- 
sates the tokenization errors well by recognizing wrongly 
classified binary and text tokens. From Figure 6 we can 
see that, for each of 11 true formats, its messages were 
merged into in a single cluster. 


4.6 CIFS/SMB 


CIFS/SMB is a fairly complex binary protocol which 
includes several layers of protocols: it consists of the 
NetBIOS Session Service (NBSS) headers which en- 
capsulate a SMB header which in turn is layered over 
RPC. Overall, we inferred 679 formats, which covered 
3,640,239 out of 3,818,267 messages (95.3%). The cov- 
ered messages belong to 130 out of 301 true formats 
(43%). 

In Figure 7, we plot the CDF of the number of true 
formats followed by messages of a cluster for all clus- 
ters before the merging phase. We can see that 57% 
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Figure 7: Correctness for CIFS/SMB: CDF of the num- 
ber of true formats followed by messages of a cluster for 
all clusters before the merging phase. 
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Figure 8: Merging effectiveness for CIFS/SMB: Number 
of inferred formats into which messages of a single true 
format are “scattered”: before and after merging 


clusters contain messages from a single true format, and 
35% clusters have messages from two true formats. We 
manually checked these clusters and found that it is due 
to the imprecise parsing of Ethereal. It recognized the 
last field as a “dcerpc.nt.close.frame” field for some mes- 
sages and as a stub data for other messages while those 
messages have the same format according to our manual 
inspection. If we take into account this factor, more than 
90% clusters contain messages from a single true format, 
which is consistent with HTTP. 

We further inspected the clusters consisting of mes- 
sages from more than two formats and found that, for 
many of such clusters, the only difference in the true for- 
mats followed by their messages is the last field. It is in 
the form of “Stub data (XX bytes)’, and the difference is 
in “XX” which says the size of the stub data. Based on 


our manual inspection, we conjecture that these stub data 
likely follow the same format and the size difference is 
due to a text field with a variable length embedded in the 
stub data. Therefore, 90% is a conservative measure on 
the correctness. 

In Figure 8, we plot the number of inferred formats 
into which messages of a single true format are scattered 
before and after merging. Like RPC, the merging tech- 
nique is effective on CIFS/SMB. Overall, we reduced the 
number of clusters from 7180 to 679 without introducing 
false positives, which gives us a 5 to | ratio against the 
130 true formats. 


5 Related Work 


We divide related work into three categories. First, 
we discuss the state of the art in protocol reverse en- 
gineering. Second, we present the previous work that 
was geared towards a specific application rather than per- 
forming all-purpose protocol reverse engineering. Fi- 
nally, we discuss grammar inference. 

To date, most protocol reverse engineering appears to 
be a painstaking manual process, which involves looking 
at available documentation, source code, and traces. Two 
popular examples in the community include the SAMBA 
project [18] and the messenger clients [6]. Automatic 
protocol reverse engineering appears to have received 
much less attention. The most closely related work to 
our paper that we are aware of is the Protocol Informatics 
project [17]. It aims to employ sequence alignment tech- 
niques to infer protocol formats from a trace of the pro- 
tocol. Its main limitation is that the byte-wise sequence 
alignment, while ideally suited to aligning messages with 
similar byte sequences, is not suitable for aligning mes- 
sages with similar formats. In addition, selecting weights 
to tune the alignment is hard as shown in [3]. 

Previous studies have also performed some level of 
protocol reverse engineering with a specific purpose 
in mind, namely, application-level replay and protocol 
identification. 

RolePlayer [3] and ScriptGen [10, 11] both lever- 
age byte-wise sequence alignment techniques to achieve 
application-level replay by heuristically detecting and 
adjusting some specific fields such as network ad- 
dresses, lengths, and cookies. A driving applica- 
tion for application-level replay is to build a protocol- 
independent, application-level proxy to filter known at- 
tacks in a honeyfarm. To improve the performance of 
the Needleman-Wunsch algorithm [15], RolePlayer uses 
the so-called pairwise constraint matrix, which specifies 
whether the 7th byte of the first message can or cannot be 
aligned with the 7th byte of the second message based on 
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the field semantic of the ith byte. However, if the seman- 
tic of the ith byte in the first message is unknown, it can 
be aligned with any byte in the second message, which 
may lead to alignment errors. There are two key differ- 
ences between Discoverer and these two systems. First, 
RolePlayer and ScriptGen only discover the protocol for- 
mat to the extent necessary for replay, while Discoverer 
is aimed to discover the complete protocol format. Sec- 
ond, instead of using the byte-wise sequence alignment, 
we first cluster messages based on token patterns and 
then use a novel type-based sequence alignment tech- 
nique to align and compare message formats based on 
token types. This represents a significant improvement: 
on one hand, we can avoid byte-pattern alignment in the 
recursive clustering phase to achieve a good performance 
on correctness; on the other hand, we can mitigate over- 
classification by merging similar inferred formats. Fur- 
thermore, compared with ScriptGen which clusters mes- 
sages by comparing the whole messages at once, our re- 
cursive clustering technique performs better because we 
not only look at the potential FD token itself but also look 
into “the future” by comparing the subsequent tokens in 
the messages. Some of our techniques for identifying 
semantically important fields (such as length fields) are 
borrowed from RolePlayer. 


Ma et. al. [13] perform protocol identification, that is, 
they classify the set of sessions in a trace into various 
protocols without relying on port numbers. They develop 
three techniques for profiling messages exchanged in a 
protocol: product distributions of byte offsets, Markov 
models of byte transitions, and common substring graphs 
of message strings. The main difference between their 
work and ours is that we have different goals. They aim 
to characterize a protocol based on the first n (e.g., 64) 
bytes in sessions of the protocol; we leverage the format 
inference and type-based sequence alignment techniques 
to decipher the message formats of the entire session. 


The problem of protocol reverse engineering is re- 
lated to the theoretical problem of grammar inference, 
which aims to deduce the grammar given a set of sam- 
ple strings drawn from it. This problem is unfortunately 
theoretically unsolvable, even when the grammar is in 
the simplest form of Chomskian grammar, the regular 
language [7]. Since even a regular language can be po- 
tentially infinite and the sample set cannot be, it turns 
out that this task is impossible. The language implicit 
in application-level protocols is often substantially more 
complex than a regular language, involving fields such 
as length fields. Because of this complexity, we were un- 
able to directly apply any results from the grammar infer- 
ence community. There have been extensions based on 
Kolmogorov complexity [4] to learn the simplest finite 


language from only positive examples, but once again, 
they appear too complicated to apply to the context sen- 
sitive grammars that network protocols involve. 

Techniques used in the speech recognition community, 
such as, probabilistic Markov chain analysis [8], were 
not applied in our work, since the correlation between 
protocol fields makes it difficult for the byte sequence in 
a message to be modeled as independent samples from a 
Markov process. 


6 Limitations and Future Work 


In this section, we discuss the limitations of our ap- 
proach. We categorize our limitations into two cate- 
gories: ones that are fundamental to the problem we want 
to solve and those that are due to the heuristics in our 
tool. We will also describe future research directions for 
solving these limitations. 

There are two main fundamental limitations. 


e Trace Dependency: The format generated by any 
tool that operates only on the trace is limited by the 
diversity of traffic seen in the trace. If certain mes- 
sage formats never occur in the trace, or if certain 
variable fields never take more than one value in the 
trace, it is impossible for such a tool to infer those 
message formats or identify those fields as variable 
fields. 


e Pre-Defined Semantics: Only a set of pre-defined 
semantics can be inferred. Since it is not possible 
to find all the possible semantics of all fields just 
from a trace, the best one can hope for is to have an 
extensible framework where new semantic modules 
can be added as desired. 


We now move on to the imprecision problems that are 
directly related to the design of our tool. The following 
are the major imprecisions in our inferred message for- 
mats: 


e Semantics: At present, we cannot capture the fol- 
lowing semantics. (a) Set semantics: For instance, 
HTTP allows an arbitrary number of parameters to 
be specified in any order. Identifying this list of sup- 
ported parameters as a set that allows re-ordering 
during encoding would considerably improve our 
performance. (b) Pointer field: This is a field whose 
value is the offset of another field in an array of 
some arbitrary items. Such fields occur in DNS. 
(c) Array length: This is a field whose value is the 
number of items in an array of some arbitrary items 
(e.g., DNS). We plan to study the inference of these 
semantics in the future. 
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e Coalescing Fields: We identify a binary field as a 
sequence of binary tokens each spanning a single 
byte. This is a limitation since ideally we would 
want such a field identified as a single binary to- 
ken. Unlike text fields, no clue may be available in 
delimiting binary fields. The only way out is tech- 
niques based on frequency analysis (e.g., does this 
byte vary as much as the other one?). However, this 
kind of techniques tend to be unreliable. For in- 
stance, in a two-byte process ID field, the more sig- 
nificant byte may change much less frequently than 
the less significant one since an operating system 
usually issues process IDs incrementally from zero. 
Thus we chose to list such fields as a series of single 
byte tokens. Our plan is to enrich our semantic in- 
ference modules so that a sequence of binary bytes 
with a common semantic can be identified as a sin- 
gle field. For example, a length field spanning four 
bytes will be identified as a single field because of 
the semantic module for detecting length fields. 


e Asynchronous Protocols: With asynchronous pro- 
tocols, it is difficult to even delimit messages from 
network packets. This is because messages in one 
direction may be interrupted by those in the other 
direction, and messages in one direction may be de- 
layed allowing two back-to-back messages in the 
other direction. We have not experimented with any 
asynchronous protocols so far. 


e Application Sessions: Currently, our tool analyzes 
each connection in isolation. However, if we had 
a good session description of the various connec- 
tions and various hosts involved, it would be trivial 
to process the trace with such session knowledge, 
and derive formats for the whole session. A previ- 
ous study [9] aimed to semi-automatically discover 
session structures. 


e State Machine Inference: Currently, we only en- 
vision a state machine constructed from the trace 
by using the inferred message formats to assign a 
type to each message, and then simply inferring the 
FSA that captures the sequences of messages in all 
sessions in the trace. However, this is hardly the 
compact FSA that the application developer had in 
mind. In such case, using FSA minimization tech- 
niques [19] may simplify the FSA considerably. 


Many of the limitations above are due to the limited in- 
formation available from network traces. To tackle these 
limitations and achieve a better reverse-engineered pro- 
tocol specification, we can use program analysis to gain 


more information and insight into the parsing and pro- 
cessing of the input in the program. For instance, we 
may easily identify two consecutive bytes as a WORD 
(i.e., a two-byte integer) from run-time analysis by ob- 
serving that they are processed as a WORD throughout 
the execution. 

We have focused on reverse engineering network pro- 
tocols in Discoverer; it would be useful to reverse engi- 
neer the input specifications for file-based applications, 
since we have seen a significant growth in file-based at- 
tacks. 


7 Conclusion 


Protocol reverse engineering is a highly manual pro- 
cess today, which is still suffered through because of the 
immense value of protocol knowledge. We have devel- 
oped Discoverer, a tool that aims to automate this re- 
verse engineering process. Discoverer leverages recur- 
sive clustering and type-based sequence alignment to in- 
fer message formats. We have demonstrated Discoverer 
can infer message formats effectively for three network 
protocols, CIFS/SMB, RPC, and HTTP. In the future, we 
plan to enrich our semantic inference, research on the 
protocol state machine inference, explore the direction of 
using program analysis to reverse engineer the specifica- 
tions of both network and file input, and apply reverse- 
engineered protocol specifications to real world applica- 
tions. 
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Abstract 


Different implementations of the same protocol specifi- 
cation usually contain deviations, i.e., differences in how 
they check and process some of their inputs. Deviations 
are commonly introduced as implementation errors or as 
different interpretations of the same specification. Auto- 
matic discovery of these deviations is important for sev- 
eral applications. In this paper, we focus on automatic 
discovery of deviations for two particular applications: 
error detection and fingerprint generation. 

We propose a novel approach for automatically de- 
tecting deviations in the way different implementations 
of the same specification check and process their input. 
Our approach has several advantages: (1) by automati- 
cally building symbolic formulas from the implementa- 
tion, our approach is precisely faithful to the implemen- 
tation; (2) by solving formulas created from two different 
implementations of the same specification, our approach 
significantly reduces the number of inputs needed to find 
deviations; (3) our approach works on binaries directly, 
without access to the source code. 

We have built a prototype implementation of our ap- 
proach and have evaluated it using multiple implemen- 
tations of two different protocols: HTTP and NTP. Our 
results show that our approach successfully finds devi- 
ations between different implementations, including er- 
rors in input checking, and differences in the interpre- 
tation of the specification, which can be used as finger- 
prints. 


1 Introduction 


Many different implementations usually exist for the 
same protocol. Due to the abundance of coding errors 
and protocol specification ambiguities, these implemen- 
tations usually contain deviations, i.e., differences in how 
they check and process some of their inputs. As a result, 
same inputs can cause different implementations to reach 


semantically different protocol states. For example, an 
implementation may not perform sufficient input check- 
ing to verify if an input is well-formed as specified in the 
protocol specification. Thus, for some inputs, it might 
exhibit a deviation from another implementation, which 
follows the protocol specification and performs the cor- 
rect input checking. 

Finding these deviations in implementations is impor- 
tant for several applications. In particular, in this paper 
we show 1) how we can automatically discover these de- 
viations, and 2) how we can apply the discovered devia- 
tions to two particular applications: error detection and 
fingerprint generation. 

First, finding a deviation between two different imple- 
mentations of the same specification may indicate that at 
least one of the two implementations has an error, which 
we call error detection. Finding such errors is important 
to guarantee that the protocol is correctly implemented, 
to ensure proper interoperability with other implementa- 
tions, and to enhance system security since errors often 
represent vulnerabilities that can be exploited. Enabling 
error detection by automatically finding deviations be- 
tween two different implementations is particularly at- 
tractive because it does not require a manually written 
model of the protocol specification. These models are 
usually complex, tedious, and error-prone to generate. 
Note that such deviations do not necessarily flag an er- 
ror in one of the two implementations, since deviations 
can also be caused by ambiguity in the specification or 
when some parts are not fully specified. However, au- 
tomatic discovery of such deviations is a good way to 
provide candidate implementation errors. 

Second, such deviations naturally give rise to finger- 
prints, which are inputs that, when given to two differ- 
ent implementations, will result in semantically differ- 
ent output states. Fingerprints can be used to distinguish 
between the different implementations and we call the 
discovery of such inputs fingerprint generation. Finger- 
printing has been in use for more than a decade [25] 
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and is an important tool in network security for remotely 
identifying which implementation of an application or 
operating system a remote host is running. Fingerprint- 
ing tools [8, 11, 15] need fingerprints to operate and con- 
stantly require new fingerprints as new implementations, 
or new versions of existing implementations, become 
available. Thus, the process of automatically finding 
these fingerprints, i.e., the fingerprint generation, is cru- 
cial for these tools. 


Automatic deviation discovery is a challenging task— 
deviations usually happen in corner cases, and discover- 
ing deviations is often like finding needles in a haystack. 
Previous work in related areas is largely insufficient. For 
example, the most commonly used technique is random 
or semi-random generation of inputs [20,43] (also called 
fuzz testing). In this line of approach, random inputs are 
generated and sent to different implementations to ob- 
serve if they trigger a difference in outputs. The obvious 
drawback of this approach is that it may take many such 
random inputs before finding a deviation. 


In this paper, we propose a novel approach to auto- 
matically discover deviations in input checking and pro- 
cessing between different implementations of the same 
protocol specification. We are given two programs P; 
and P2 implementing the same protocol. At a high level, 
we build two formulas, f; and f2, which capture how 
each program processes a single input. Then, we check 
whether the formula (f1 A 7f2) V (-f1 A fa) is satisfi- 
able, using a solver such as a decision procedure. If the 
formula is satisfiable, it means that we can find an input, 
which will satisfy f; but not f2 or vice versa, in which 
case it may lead the two program executions to seman- 
tically different output states. Such inputs are good can- 
didates to trigger a deviation. We then send such candi- 
date inputs to the two programs and monitor their output 
states. If the two programs end up in two semantically 
different output states, then we have successfully found 
a deviation between the two implementations, and the 
corresponding input that triggers the deviation. 


We have built a prototype implementation of our ap- 
proach. It handles both Windows and Linux binaries 
running on an x86 platform. We have evaluated our ap- 
proach using multiple implementations of two different 
protocols: HTTP and NTP. Our approach has success- 
fully identified deviations between servers and automat- 
ically generated inputs that triggered different server be- 
haviors. These deviations include errors and differences 
in the interpretation of the protocol specification. The 
evaluation shows that our approach is accurate: in one 
case, the relevant part of the generated input is only three 
bits. Our approach is also efficient: we found deviations 
using a single request in about one minute. 


Contributions. In summary, in this paper, we make the 
following contributions: 


e Automatic discovery of deviations: We propose 
a novel approach to automatically discover devia- 
tions in the way different implementations of the 
same protocol specification check and process their 
input. Our approach has several advantages: (1) 
by automatically building symbolic formulas from 
an implementation, our approach is precisely faith- 
ful to the implementation; (2) by solving formulas 
created from two different implementations of the 
same specification, our approach significantly re- 
duces the number of inputs needed to find devia- 
tions; (3) our approach works on binaries directly, 
without access to the source code. This is important 
for wide applicability, since implementations may 
be proprietary and thus not have the source code 
available. In addition, the binary is what gets ex- 
ecuted, and thus it represents the true behavior of 
the program. 


e Error detection using deviation discovery: We 
show how to apply our approach for automati- 
cally discovering deviations to the problem of error 
detection—the discovered deviations provide can- 
didate implementation errors. One fundamental ad- 
vantage of our approach is that it does not require 
a user to manually generate a model of the protocol 
specification, which is often complex, tedious, and 
error-prone to generate. 


e Fingerprint generation using deviation discov- 
ery: We show how to apply our approach for 
automatically discovering deviations to the prob- 
lem of fingerprint generation—the discovered devi- 
ations naturally give rise to fingerprints. Compared 
to previous approaches, our solution significantly 
reduces the number of candidate inputs that need 
to be tested to discover a fingerprint [20]. 


e Implementing the approach: We have built a pro- 
totype that implements our approach. Our evalua- 
tion shows that our approach is accurate and effi- 
cient. It can identify deviations with few example 
inputs at bit-level accuracy. 


The remainder of the paper is organized as fol- 
lows. Section 2 introduces the problem and presents 
an overview of our approach. In Section 3 we present 
the different phases and elements that comprise our ap- 
proach and in Section 4 we describe the details of our 
implementation. Then, in Section 5 we present the eval- 
uation results of our approach over different protocols. 
We discuss future enhancements to our approach in Sec- 
tion 6. Finally, we present the related work in Section 7 
and conclude in Section 8. 
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2 Problem Statement 
Overview 


and Approach 


In this section, we first describe the problem statement, 
then we present the intuition behind our approach, and 
finally we give an overview of our approach. 


Problem statement. In this paper we focus on the 
problem of automatically detecting deviations in proto- 
col implementations. In particular, we aim to find inputs 
that cause two different implementations of the same 
protocol specification to reach semantically different out- 
put states. When we find such an input, we say we have 
found a candidate deviation. 

The output states need to be externally observable. We 
use two methods to observe such states: (a) monitoring 
the network output of the program, and (b) supervising 
its environment, which allows us to detect unexpected 
states such as program halt, reboot, crash, or resource 
starvation. However, we cannot simply compare the 
complete output from both implementations, since the 
output may be different but semantically equivalent. For 
example, many protocols contain sequence numbers, and 
we would expect the output from two different imple- 
mentations to contain two different sequence numbers. 
However, the output messages may still be semantically 
equivalent. 

Therefore, we may use some domain knowledge about 
the specific protocol being analyzed to determine when 
two output states are semantically different. For exam- 
ple, many protocols such as HTTP, include a status code 
in the response to provide feedback about the status of 
the request. We use this information to determine if two 
output states are semantically equivalent or not. In other 
cases, we observe the effect of a particular query in the 
program, such as program crash or reboot. Clearly these 
cases are semantically different from a response being 
emitted by the program. 


Intuition of our approach. We are given two imple- 
mentations P; and P2 of the same protocol specification. 
Each implementation at a high level can be viewed as 
a mapping function from the protocol input space I to 
the protocol output state space S. Let P|, P2:I—- S 
represent the mapping function of the two implementa- 
tions. Each implementation accepts inputs x € I (e.g., 
an HTTP request), and then processes the input resulting 
in a particular protocol output state s € S' (e.g., an HTTP 
reply). At a high level, we wish to find inputs such that 
the same input, when sent to the two implementations, 
will cause each implementation to result in a different 
protocol output state. 

Our goal is to find an input x € I such that P;(x) 4 
P>(x). Finding such an input through random testing is 
usually hard. 


However, in general it is easy to find an input x € I 
such that P;(x) = Po(x) = s € S,i.e., most inputs 
will result in the same protocol output state s for differ- 
ent implementations of the same specification. Let f(x) 
be the formula representing the set of inputs x such that 
f(x) =true <= > P(x) = s. When P, and P> imple- 
ment the same protocol differently, there may be some 
input where f; will not be the same as fo: 





Av.( f(x) Aafa(x)) V (fia) A fo) = true. 


The intuition behind the above expression is that when 
fi(x) A afo(x) = true, then P(x) = s (because 
fi(a) = true) while P(x) # s (because fo(x) = 
false), thus the two implementations reach different out- 
put states for the same input x. Similarly, =f; (a) A fo(x) 
indicates when P;(x) 4 s, but P(x) = s. We take the 
disjunction since we only care whether the implementa- 
tions differ from each other. 

Given the above intuition, the central idea is to create 
the formula f using the technique of weakest precondi- 
tion [19, 26]. Let @ be a predicate over the state space 
of a program. The weakest precondition wp(P,Q) for 
a program P and post-condition Q is a boolean formula 
f over the input space of the program. In our setting, if 
f(x) = true, then P(x) will terminate in a state satisfy- 
ing Q, and if f(x) = false, then P(x) will not terminate 
in a state satisfying @ (it either “goes wrong” or does not 
terminate). For example, if the post-condition @ is that 
P outputs a successful HTTP reply, then f = wp(P, Q) 
characterizes all inputs which lead P to output a suc- 
cessful HTTP reply. The boolean formula output by the 
weakest precondition is our formula f. 

Furthermore, we observe that the above method can 
still be used even if we do not consider the entire pro- 
gram and only consider a single execution path (we dis- 
cuss multiple execution paths in Section 6). In that case, 
the formula f represents the subset of protocol inputs 
that follow one of the execution paths considered and still 
reach the protocol output state s. Thus, f(a) = true > 
P(x) = s, since if an input satisfies f then for sure it 
will make program P go to state s, but the converse is 
not necessarily true—an input which makes P go to state 
s may not satisfy f. In our problem, this means that the 
difference between f; and f2 may not necessarily result 
in a true deviation, as shown in Figure 2. Instead, the 
difference between f; and f2 is a good candidate, which 
we can then test to validate whether it is a true deviation. 


Overview of our approach. Our approach is an itera- 
tive process, and each iteration consists of three phases, 
as shown in Figure 1. First, in the formula extraction 
phase, we are given two binaries P; and P2 implement- 
ing the same protocol specification, such as HTTP, and 
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Figure 1: Overview of our approach. 


an input z, such as an HTTP GET request. For each 
implementation, we log an execution trace of the binary 
as it processes the input, and record what output state it 
reaches, such as halting or sending a reply. We assume 
that the execution from both binaries reaches semanti- 
cally equivalent output states; otherwise we have already 
found a deviation! For each implementation P, and 
P,, we then use this information to produce a boolean 
formula over the input, f; and f2 respectively, each of 
which is satisfied for inputs that cause the binary to reach 
the same output state as the original input did. 


Next, in the deviation detection phase, we use a solver 
(such as a decision procedure) to find differences in the 
two formulas f; and f2. In particular, we ask the solver 
if (f1 A7f2) V (f2 \—f1) is satisfiable. When satisfiable 
the solver will return an example satisfying input. We 
call these inputs the candidate deviation inputs. 


Finally, in the validation phase we evaluate the can- 
didate deviation inputs obtained in the formula extrac- 
tion phase on both implementations and check whether 
the implementations do in fact reach semantically differ- 
ent output states. This phase is necessary because the 
symbolic formula might not include all possible execu- 
tion paths, then an input that satisfies f; is guaranteed to 
make P; reach the same semantically equivalent output 
state as the original input x but an input that does not 
satisfy f; may also make P, reach a semantically equiv- 
alent output state. Hence, the generated candidate devia- 
tion inputs may actually still cause both implementations 
to reach semantically equivalent output states. 

If the implementations do reach semantically different 
output states, then we have found a deviation triggered by 
that input. This deviation is useful for two things: (1) it 


may represent an implementation error in at least one of 
the implementations, which can then be checked against 
the protocol specification to verify whether it is truly an 
error; (2) it can be used as a fingerprint to distinguish 
between the two implementations. 


Iteration. We can iterate this entire process to examine 
other input types. Continuing with the HTTP example, 
we can compare how the two implementations process 
other types of HTTP requests, such as HEAD and POST, 
by repeating the process on those types of requests. 


3 Design 


In this section, we describe the details of the three phases 
in our approach, the formula extraction phase, the devia- 
tion detection phase, and the validation phase. 


3.1 Formula Extraction Phase 
3.1.1 Intuition and Overview 


The goal of the formula extraction phase is that given an 
input x such that P\(~) = P:(”) = s, where s is the 
output state when executing input x with the two given 
programs, we would like to compute two formulas, /; 
and fz, such that, 


fi(x) = true > P(x) =s 


and 
fo(x) = true > P(x) = s, 


This matches well with the technique of weakest precon- 
dition (WP) [19, 26]. The weakest precondition, denoted 
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wp(P, Q), is a boolean formula f over the input space I 
of P such that if f(x) = true, then P(x) will terminate 
in a state satisfying Q. In our setting, the post-condition 
is the protocol output state, and the weakest precondition 
is a formula characterizing protocol inputs, which will 
cause the implementation to reach the specified protocol 
output state. 

Unfortunately, calculating the weakest precondition 
over an entire real-world binary program can easily re- 
sult in a formula that is too big to solve. First, there may 
be many program paths which can lead to a particular 
output state. We show that we can generate interesting 
deviations even when considering a single program path. 
Second, we observe that in many cases only a small sub- 
set of instructions operate on data derived from the origi- 
nal input. There is no need to model the instructions that 
do not operate on data derived from the original input, 
since the result they compute will be the same as in the 
original execution. Therefore we eliminate these instruc- 
tions from the WP calculation, and replace them with 
only a series of assignments of concrete values to the rel- 
evant program state just before an instruction operates on 
data derived from the input. 

Hence, in our design, we build the symbolic formula 
in two distinct steps. We first execute the program on the 
original input, while recording a trace of the execution. 
We then use this execution trace to build the symbolic 
formula. 


3.1.2 Calculating the Symbolic Formula 


In order to generate the symbolic formula, we perform 
the following steps: 


1. Record the execution trace of the executed program 
path. 

2. Process the execution trace. This step translates the 

execution trace into a program B written in our sim- 

plified intermediate representation (IR). 

Generate the appropriate post-condition Q. 

4. Calculate the weakest precondition on B by: 


YW 


(a) Translating B into a single assignment form. 


(b) Translating the (single assignment) IR pro- 
gram into the guarded command language 
(GCL). The GCL program, denoted B,, is 
semantically equivalent to the input IR state- 
ments, but appropriate for the weakest precon- 
dition calculation. 


(c) Computing the weakest precondition f = 
wp(Bg, Q) ina syntax-directed fashion on the 
GCL. 


The output of this phase is the symbolic formula f. 
Below we describe these steps in more detail. 


Step 1: Recording the execution trace. We generate 
formulas based upon the program path for a single ex- 
ecution. We have implemented a path recorder which 
records the execution trace of the program. The exe- 
cution trace is the sequence of machine instructions ex- 
ecuted, and for each executed instruction, the value of 
each operand, whether each operand is derived from the 
input, and if it is derived from the input, an identifier for 
the original input stream it comes from. The trace also 
has information about the first use of each input byte, 
identified by its offset in the input stream. For example, 
for data derived from network inputs, the identifier spec- 
ifies which session the input came from, and the offset 
specifies the original position in the session data. 


Step 2: Processing the execution trace. We process 
the execution trace to include only relevant instructions. 
An instruction is relevant if it operates on data derived 
from the input J. For each relevant instruction, we: 


e Translate the x86 instruction to an easier-to-analyze 
intermediate representation (IR). The generated IR 
is semantically equivalent to the original instruc- 
tion. 


The advantage of our IR is that it allows us to per- 
form subsequent steps over the simpler IR state- 
ments, instead of the hundreds of x86 instructions. 
The translation from an x86 instruction to our IR 
is designed to correctly model the semantics of the 
original x86 instruction, including making other- 
wise implicit side effects explicit. For example, we 
insert code to correctly model instructions that set 
the eflag_ register, single instruction loops (e.g., 
rep instructions), and instructions that behave dif- 
ferently depending on the operands (e.g., shifts). 


Our IR is shown in Table 1. We translate x86 in- 
struction into this IR. Our IR has assignments (r := 
v), binary and unary operations (r := r,;Oyv and 
uv Where O, and O,, are binary and unary 
operators), loading a value from memory into a reg- 
ister (r; := *(r2)), storing a value (*(r1) := 19), 
direct jumps (jmp £) to a known target label (label 
£;), indirect jumps to a computed value stored in a 
register (ijmp r), and conditional jumps (if r then 
jmp @; else jmp £2). 









































r= 





e Translate the information logged about the operands 
into a sequence of initialization statements. For 
each operand: 


— If it is not derived from input, the operand is 
assigned the concrete value logged in the ex- 
ecution trace. These assignments effectively 
model the sequences of instructions that we do 
not explicitly include. 
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Instructions i n= o€(r1) := ralri = *(r2)|r = vlr = 71 Opv 
|r := O,v | label l;|jmp | ijmpr 
|if r jmp 0; el e jmp & 
Operations » = t+,-,4,/,<,>,&,|,6,==,! =, <, < (Binary operations) 
u u=  —,! (unary operations) 
Operands Vv n (an integer literal) | r (a register) | ¢ (a label) 
Reg. Types T n= reg64t | reg32-t | reg16t | reg8_t | reg1_t (number of bits) 


Table 1: Our RISC-like assembly IR. We convert x86 assembly instructions into this IR. 


— For operands derived from input, the first time 
we encounter a byte derived from a particu- 
lar input identifier and offset, we initialize the 
corresponding byte of the operand with a sym- 
bolic value that uniquely identifies that input 
identifier and offset. On subsequent instruc- 
tions that operate on data derived from that 
particular input identifier and offset, we do 
not initialize the corresponding operand, since 
we want to accurately model the sequence of 
computations on the input. 


The output of this step is an IR program B consisting 
of a sequence of IR statements. 


Step 3: Setting the post-condition. Once we have 
generated the IR program from the execution trace, the 
next step is to select a post-condition, and compute the 
weakest precondition of this post-condition over the pro- 
gram, yielding our symbolic formula. 

The post-condition specifies the desired protocol out- 
put state, such as what kind of response to a request 
message is desired. In our current setting, an ideal post- 
condition would specify that “The input results in an ex- 
ecution that results in an output state that is semantically 
equivalent to the output state reached when processing 
the original input.” That is, we want our formula to be 
true for exactly the inputs that are considered “seman- 
tically equivalent” to the original input by the modeled 
program binary. 

In our approach, the post-condition specified the out- 
put state should be the same as in the trace. In order to 
make the overall formula size reasonable, we add addi- 
tional constraints to the post-condition which constraint 
the formula to the same program path taken as in the 
trace. We do this by iterating over all conditional jumps 
and indirect jumps in the IR, and for each jump, add a 
clause to the post-condition that ensures that the final for- 
mula only considers inputs that also result in the same 
destination for the given jump. For example, if in the 
trace if e then ¢, el e &5 was evaluated and the 
next instruction executed was (2, then e must have eval- 
uated to false, and we add a clause restricting e = false 
to the post-condition. 


In some programs, there may be multiple paths that 
reach the same output state. Our techniques can be gen- 
eralized to handle this case, as discussed in Section 6. In 
practice, we have found this post-condition to be suffi- 
cient for finding interesting deviations. Typically, inputs 
that cause the same execution path to be followed are 
treated equivalently by the program, and result in equiv- 
alent output states. Conversely, inputs that follow a dif- 
ferent execution path often result in a semantically dif- 
ferent output state of the program. Although more com- 
plicated and general post-conditions are possible, one in- 
teresting result from our experiments is that the simple 
approach was all that was needed to generate interesting 
deviations. 


Step 4: Calculating the weakest precondition. The 
weakest precondition (WP) calculation step takes as in- 
put the IR program B from Step 2, and the desired post- 
condition @ from Step 3. The weakest precondition, de- 
noted wp(B,Q), is a boolean formula f over the input 
space such that if f(x) = true, then B(x) will terminate 
in a state satisfying @. For example, if the program is 
B:y=x+1landQ:2< y < 5, then wp(B,Q) is 
l<a<4. 

We describe the steps for computing the weakest pre- 
condition below. 

Step 4a: Translating into single assignment form. We 
translate the IR program B from the previous step into 
a form in which every variable is assigned at most once. 
(The transformed program is semantically equivalent to 
the input IR.) We perform this step to enable additional 
optimizations described in [19, 29, 36], which further re- 
duce the formula size. For example, this transformation 
will rewrite the program —: : as 

: : We carry out this 
transformation by maintaining a mapping from the vari- 
able name to its current incarnation, e.g., the original 
variable may have incarnations ; , and . We 
iterate through the program and replace each variable use 
with its current incarnation. This step is similar to com- 
puting the SSA form of a program [39], and is a widely 
used technique. 

Step 4b: Translating to GCL. The translation to GCL 
takes as input the single assignment form from step 4a, 
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and outputs a semantically equivalent GCL program B,. 
We perform this step since the weakest precondition is 
calculated over the GCL language [26]. The result- 
ing program 6, is semantically equivalent to the input 
single-assignment IR statements. The weakest precondi- 
tion is calculated in a syntax-directed manner over By. 

The GCL language constructs we use are shown in 
Table 2. Although GCL may look unimpressive, it is 
sufficiently expressive for reasoning about complex pro- 
grams [24, 26, 28, 29] ' Statements S in our GCL pro- 
grams will mirror statements in assembly, e.g., store, 
load, assign, etc. GCL has assignments of the form 
lhs := e where lhs is a register or memory location, and 
e is a (side-effect) free expression. assume e assumes 
a particular (side-effect free) expression is true. An as- 
sume statement is used to reason about conditional jump 
predicates, i.e., we add “assume ec” for the true branch of 
a conditional jump, and “assume —e” for the false branch 
of the conditional jump. assert e asserts that e must be 
true for execution to continue, else the program fails. In 
other words, @ cannot be satisfied if assert e is false. 
skip is a semantic no-op. 51;S2 denotes a sequence 
where first statement 5; is executed and then statement 
So is executed. SOS is called a choice statement, and 
indicates that either S; or Sz may be executed. Choice 
statements are used for if-then-else constructs. 

For example, the IR: 














if (20 < O){ 
t1:=2%—-1; 
} else { 


i= Aa%t+l; 


} 


will be translated as: 











(assume xo < 0; 21 = x — 1;) 





(assume —(29 < 0); 21 := 2 + 1;) 


The above allows calculating the WP over multiple 
paths (we discuss multiple paths in Section 6). In our 
setting, we only consider a single path. For each branch 
condition e evaluated in the trace, we could add the GCL 
statement assert e if e evaluated to true (else assert —e 
if e evaluated to false). In our implementation, using as- 
sert in this manner is equivalent to adding a clause for 
each branch predicate to the post-condition (e.g., making 
the post-condition e \ @ when e evaluated to true in the 
trace). 

Step 4c: Computing the weakest precondition. We 
compute the weakest precondition for B, from the pre- 
vious step in a syntax-directed manner. The rules for 
computing the weakest precondition are shown in Ta- 
ble 2. Most rules are straightforward, e.g., to calcu- 


!The GCL defines a few additional commands such as a do-while 
loop, which we do not need. 


late the weakest precondition wp(A; B, Q), we calculate 
wp(A, wp(B,Q)). Similarly wp(assume ce, Q) = e > 
Q. For assignments [hs := e, we generate a let expres- 
sion which binds the variable name /hs to the expression 
e. We also take advantage of a technical transformation, 
which can further reduce the size of the formula by using 
the single assignment form from Step 4a [19, 29, 36]. 


3.1.3. Memory Reads and Writes to Symbolic Ad- 
dresses 


If the instruction accesses memory using an address that 
is derived from the input, then in the formula the address 
will be symbolic, and we must choose what set of possi- 
ble addresses to consider. In order to remain sound, we 
add a clause to our post-condition to only consider execu- 
tions that would calculate an address within the selected 
set. Considering more possible addresses increases the 
generality of our approach, at the cost of more analysis. 


Memory reads. When reading from a memory loca- 
tion selected by an address derived from the input, we 
must process the memory locations in the set of ad- 
dresses being considered as operands, generating any ap- 
propriate initialization statements, as above. 


We achieve good results considering only the address 
that was actually used in the logged execution trace 
and adding the corresponding constraints to the post- 
condition to preserve soundness. In practice, if useful 
deviations are not found from the corresponding formula, 
we could consider a larger range of addresses, achieving 
a more descriptive formula at the cost of performance. 
We have implemented an analysis that bounds the range 
of symbolic memory addresses [2], but have found we 
get good results without preforming this additional step. 


Memory writes. We need not transform writes to 
memory locations selected by an address derived from 
the input. Instead we record the selected set of addresses 
to consider, and add the corresponding clause to the post- 
condition to preserve soundness. These conditions force 
the solver to reason about any potential alias relation- 
ships. As part of the weakest precondition calculation, 
subsequent memory reads that could use one of the ad- 
dresses being considered are transformed to a conditional 
statement handling these potential aliasing relationships. 


As with memory reads, we achieve good results only 
considering the address that was actually used in the 
logged execution trace. Again, we could generalize the 
formula to consider more values, by selecting a range of 
addresses to consider. 
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A,BeéGCLstmt ::=lhs:=e 

| A;B 

| assume e (e is an expression) 
| assert e (e is an expression) 
|AQOB 


| skip 

















GCL stmt wp(stmt, Q) 
assumee e>Q 

assert e eAQ 

lhs :=e let lhs =e 

A;B wp(A, wp(B,Q)) 
AUB wp(A, Q) A wp(B,Q) 











Table 2: The guarded command language (left), along with the corresponding weakest precondition predicate trans- 


former (right). 
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Figure 2: Different execution paths could end up in the 
same output states. The validation phase checks whether 
the new execution path explored by the candidate devia- 
tion input obtained in the deviation detection phase truly 
ends up in a different state. 


3.2 Deviation Detection Phase 


In this phase, we use a solver to find candidate inputs 
which may cause deviations. This phase takes as input 
the formulas f; and f2 generated for the programs P; 
and P» in the formula extraction phase. We rewrite the 
variables in each formula so that they refer to the same 
input, but each to their own internal states. 


We then query the solver whether the combined for- 
mula (fi A af2) V (=fi A f2) is satisfiable, and if so, 
to provide an example that satisfies the combined for- 
mula. If the solver returns an example, then we have 
found an input that satisfies one program’s formula, but 
not the other. If we had perfectly and fully modeled each 
program, and perfectly specified the post-condition to be 
that “the input results in a semantically equivalent output 
state’, then this input would be guaranteed to produce a 
semantically equivalent output state in one program, but 
not the other. Since we only consider one program path 
and do not perfectly specify the post-condition in this 
way, this input is only a candidate deviation input. 


3.3 Validation Phase 


Finally, we validate the generated candidate deviation in- 
puts to determine whether they actually result in seman- 
tically different output states in the two implementations. 
As illustrated in Figure 2, it is possible that while an in- 
put does not satisfy the symbolic formula generated for 
a server, it actually does result in an identical or seman- 
tically equivalent output state. 

We send each candidate deviation input to the imple- 
mentations being examined, and compare their outputs to 
determine whether they result in semantically equivalent 
or semantically different output states. 

In theory, this testing requires some domain knowl- 
edge about the protocol implemented by the binaries, to 
determine whether their outputs are semantically equiva- 
lent. In practice, we have found deviations that are quite 
obvious. Typically, the server whose symbolic formula 
is satisfied by the input produces a response similar to 
its response to the original input, and the server whose 
symbolic formula is not satisfied by the input produces 
an error message, drops the connection, etc. 


4 Implementation 


Our implementation consists of several components: a 
path recorder, the symbolic formula generator, the solver, 
and a validator. We describe each below. 


Collecting the trace. The symbolic formula generator 
component is based on QEMU, a complete system em- 
ulator [10]. We use a modified version of QEMU, that 
has been enhanced with the ability to track how speci- 
fied external inputs, such as keyboard or received net- 
work data are procesed. The formula generator moni- 
tors the execution of a binary and records the execution 
trace, containing all instructions executed by the program 
and the information of their operands, such as their value 
and whether they are derived from specified external in- 
puts. We start monitoring the execution before sending 
requests to the server and stop the trace when we observe 
a response from the server. We use a no-response timer 
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to stop the trace if no answer is observed from the server 
after a configurable amount of time. 


Symbolic formula generation. We implemented our 
symbolic formula generator as part of our BitBlaze bi- 
nary analysis platform [1]. The BitBlaze platform can 
parse executables and instruction traces, disassemble 
each instruction, and translate the instructions into the IR 
shown in Table 1. The entire platform consists of about 
16,000 lines of C/C++ code and 28,000 lines of OCaml, 
with about 1,600 lines of OCaml specifically written for 
our approach. 


Solver. We use STP [30,31] as our solver. It is a deci- 
sion procedure specialized in modeling bit-vectors. After 
taking our symbolic formula as input, it either outputs an 
input that can satisfy the formula, or decides that the for- 
mula is not satisfiable. 


Candidate deviation input validation. Once a candi- 
date deviation input has been returned by the solver, we 
need to validate it against both server implementations 
and monitor the output states. For this we have built 
small HTTP and NTP clients that read the inputs, send 
them over the network to the servers, and capture the re- 
sponses, if any. 

After sending candidate inputs to both implementa- 
tions, we determine the output state by looking at the 
response sent from the server. For those protocols that 
contain some type of status code in the response, such as 
HTTP in the Status-Line, each different value of the sta- 
tus code represents a different output state for the server. 
For those protocols that do not contain a status code in 
the response, such as NTP, we define a generic valid state 
and consider the server to have reached that state, as a 
consequence of an input, if it sends any well-formed re- 
sponse to the input, independently of the values of the 
fields in the response. 

In addition, we define three special output states: a fa- 
tal state that includes any behavior that is likely to cause 
the server to stop processing future queries such as a 
crash, reboot, halt or resource starvation, a no-response 
state that indicates that the server is not in the fatal state 
but still did not respond before a configurable timer ex- 
pired, and a malformed state that includes any response 
from the server that is missing mandatory fields. This 
last state is needed because servers might send messages 
back to the client that do not follow the guidelines in the 
corresponding specification. For example several HTTP 
servers, such as Apache or Savant, might respond to an 
incorrect request with a raw message written into the 
socket, such as the string “IOError’ without including 


Apache HTTP server 4,344kB 
Miniweb HTTP server 528kB 
Savant HTTP server 280kB 





NetTime | 2.0 a7 NTP server 3,702kB 

Ntpd 4.1.72 NTP server 192kB 
Table 3: Different server implementations used in our 
evaluation. 


the expected HTTP Status-Line such as “HTTP/1.1 400 
Bad Request”. 


5 Evaluation 


We have evaluated our approach on two different proto- 
cols: HTTP and NTP. We selected these two protocols 
as representatives of two large families of protocols: text 
protocols (e.g. HTTP) and binary protocols (e.g. NTP). 
Text and binary protocols present significant differences 
in encoding, field ordering, and methods used to separate 
fields. Thus, it is valuable to study both families. In par- 
ticular, we use three HTTP server implementations and 
two NTP server implementations, as shown in Table 3. 
All the implementations are Windows binaries and the 
evaluation is performed on a Linux host running Fedora 
Core 5. 

The original inputs, which we need to send to the 
servers during the formula extraction phase to generate 
the execution traces, were obtained by capturing a net- 
work trace from one of our workstations and selecting 
all the HTTP and NTP requests that it contained. For 
each HTTP request in the trace, we send it to each of the 
HTTP servers and monitor its execution, generating an 
execution trace as output. We proceed similarly for each 
NTP request, obtaining an execution trace for each re- 
quest/server pair. In Section 5.1, we show the deviations 
we discovered in the web servers, and in Section 5.2, the 
deviations we discovered in the NTP servers. 


5.1 Deviations in Web Servers 


This section shows the deviations we found among three 
web server implementations: Apache, Miniweb, and Sa- 
vant. For brevity and clarity, we only show results for a 
specific HTTP query, which we find to be specially im- 
portant because it discovered deviations between differ- 
ent server pairs. Figure 3 shows this query, which is an 
HTTP GET request for the file /inde .html. 


Deviations detected. For each server we first calculate 
a symbolic formula that represents how the server han- 
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Original request: 
: /inde .html 
H / .-Ho t: 


Figure 3: One of the original HTTP requests we used to generate execution traces from all HTTP servers, during the 
formula extraction phase. 


lee a ie | a | 
Case 1: unsatisfiable | Case 2: 5/0 


Case 4: 5/5 


Case 33 
Case 5: unsatisfiable | Case 6: unsatisfiable 





Table 4: Summary of deviations found for the HTTP servers, including the number of candidate input queries requested 
to the solver and the number of deviations found. Each cell represents the results from one query to the solver and each 
query to the solver handles half of the combined formula for each server pair. For example Case 3 shows the results 
when querying the solver for (faz A —f.4) and the combined formula for the Apache-Miniweb pair is the disjunction 


of Cases 1 and 3. 


dled the original HTTP request shown in Figure 3. We 
call these formulas: f4, fs, fas for Apache, Savant and 
Miniweb respectively. Then, for each of the three pos- 
sible server pairs: Apache-Miniweb, Apache-Savant and 
Savant-Miniweb, we calculate the combined formula as 
explained in Section 3. For example, for the Apache- 
Miniweb pair, the combined formula is (f4 A afi) V 
(fu A fa). To obtain more detailed information, we 
break the combined formula into two separates queries to 
the solver, one representing each side of the disjunction. 
For example, for the Apache-Miniweb pair, we query the 
solver twice: one for (f4 A >fx7) and another time for 
(fu A7fa). The combined formula is the disjunction of 
the two responses from the solver. 

Table 4 summarizes our results when sending the 
HTTP GET request in Figure 3 to the three servers. Each 
cell of the table represents a different query to the solver, 
that is, half of the combined formula for each server 
pair. Thus, the table has six possible cells. For exam- 
ple, the combined formula for the Apache-Miniweb pair, 
is shown as the disjunction of Cases 1 and 3. 

Out of the six possible cases, the solver returned un- 
satisfiable for three of them (Cases 1, 5, and 6). For the 
remaining cases, where the solver was able to generate at 
least one candidate deviation input, we show two num- 
bers in the format X/Y. The X value represents the num- 
ber of different candidate deviation inputs we obtained 
from the solver, and the Y value represents the number 
of these candidate deviation inputs that actually gener- 
ated semantically different output states when sent to the 
servers in the validation phase. Thus, the Y value repre- 
sents the number of inputs that triggered a deviation. 

In Case 2, none of the five candidate deviation inputs 
returned by the solver were able to generate semantically 


different output states when sent to the servers, that is, no 
deviations were found. For Cases 3 and 4, all candidate 
deviation inputs triggered a deviation when sent to the 
servers during the validation phase. In both cases, the 
Miniweb server accepted some input that was rejected by 
the other server. We analyze these cases in more detail 
next. 


Applications to error detection and fingerprint gener- 
ation. Figure 4 shows one of the deviations found for 
the Apache-Miniweb pair. It presents one of the candi- 
date deviation inputs obtained from the solver in Case 3, 
and the responses received from both Apache and Mini- 
web when that candidate input was sent to them dur- 
ing the validation phase. The key difference is on the 
fifth byte of the candidate deviation input, whose original 
ASCII value represented a slash, indicating an absolute 
path. In the generated candidate deviation input, the byte 
has value OxE8. We have confirmed that Miniweb does 
indeed accept any value on this byte. So, this deviation 
reflects an error by Miniweb: it ignores the first character 
of the requested URI and assumes it to be a slash, which 
is a deviation from the URI specification [16]. 

Figure 5 shows one of the deviations found for the 
Savant-Miniweb pair. It presents one of the candidate de- 
viation inputs obtained from the solver in Case 4, includ- 
ing the responses received from both Savant and Mini- 
web when the candidate deviation input was sent to them 
during the validation phase. Again, the candidate devi- 
ation input has a different value on the fifth byte, but in 
this case the response from Savant is only a raw “File not 
found” string. Note that this string does not include the 
HTTP Status-Line, the first line in the response that in- 
cludes the response code, as required by the HTTP spec- 
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Candidate deviation input: 


Miniweb response: 
H fre H / 
erver: iniweb ate: 


ache control: no cache erver: 


at, 


-inde .html 


Apache response: 


ad 
eb 


euet 


pache/ in 


Figure 4: Example deviation found for Case 3, where Miniweb’s formula is satisfied while Apache’s isn’t. The figure 
includes the candidate deviation input being sent and the responses obtained from the servers, which show two different 


output states. 


Candidate deviation input: 


Miniweb response: 
HH / 
erver: iniweb 


ache control: no cache 


-inde .html 


Savant response: 
ile not found 


Figure 5: Example deviation found for Case 4, where Miniweb’s formula is satisfied while Savant’s isn’t. The output 
states show that Miniweb accepts the input but Savant rejects it with a malformed response. 


ification and can be considered malformed [27]. Thus, 
this deviation identifies an error though in this case both 
servers (i.e. Miniweb and Savant) are deviating from the 
HTTP specification. 


Figure 6 shows another deviation found in Case 4 for 
the Savant-Miniweb pair. The HTTP specification man- 
dates that the first line of an HTTP request must include a 
protocol version string. There are 3 possible valid values 
for this version string: “HTTP/1.1”, “HTTP/1.0”, and 
“HTTP/0.9”, corresponding to different versions of the 
HTTP protocol. However, we see that the candidate de- 
viation input produced by the solver uses instead a dif- 
ferent version string, ”"HTTP/\b.1”. Since Miniweb ac- 
cepts this answer, it indicates that Miniweb is not prop- 
erly verifying the values received on this field. On the 
other hand, Savant is sending an error to the client indi- 
cating an invalid HTTP version, which indicates that it 
is properly checking the value it received in the version 
field. This deviation shows another error in Miniweb’s 
implementation. 


To summarize, in this section we have shown that our 
approach is able to discover multiple inputs that trigger 
deviations between real protocol implementations. We 
have presented detailed analysis of three of them, and 


confirmed the deviations they trigger as errors. Out of 
the three inputs analyzed in detail, two of them can be 
attributed to be Miniweb’s implementation errors, while 
the other one was an implementation error by both Mini- 
web and Savant. The discovered inputs that trigger devi- 
ations can potentially be used as fingerprints to differen- 
tiate among these implementations. 


5.2 Deviations in Time Servers 


In this section we show our results for the NTP protocol 
using two different servers: NetTime [7] and Ntpd [13]. 
Again, for simplicity, we focus on a single request that 
we show in Figure 7. This request represents a simple 
query for time synchronization from a client. The request 
uses the Simple Network Time Protocol (SNTP) Version 
4 protocol, which is a subset of NTP [38]. 


Deviations detected. First, we generate the symbolic 
formulas for both servers: fr and fx for NetTime and 
Ntpd respectively using the original request shown in 
Figure 7. Since we have one server pair, we need to 
query the solver twice. In Case 7, we query the solver for 
(fn A afr) and in Case 8 we query it for (fr A 7fn). 
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Candidate deviation input: 


/inde .html 
H /.. ..HO... 
Miniweb response: Savant response: 
H fe H / nly and re ue t upported 
erver: iniweb erver: avant/ 
ache control: no cache ontent ype: te t/html 


Figure 6: Another example deviation for Case 4, between Miniweb and Savant. The main different is on byte 21, 
which is part of the Version string. In this case Miniweb accepts the request but Savant rejects it. 












































Original request: 
2 eS fa 
*, | 
c oe acae a LI VN MD 
Candidate deviation input: 
; | 
ce acaea LI VN MD 
NetTime response: Ntpd response: 
: £ fa o re pone 
c e ca ic aecc e acae a 
c e ec e e 


Figure 7: Example deviation obtained for the NTP servers. It includes the original request sent in the formula extraction 
phase, the candidate deviation input output by the solver, and the responses received from the servers, when replaying 
the candidate deviation input. Note that the output states are different since NetTime does send a response, while Ntpd 


does not. 


The solver returns unsatisfiable for Case 7. For Case 8, 
the solver returns several candidate deviation inputs. Fig- 
ure 7 presents one of the deviations found for Case 8. 
It presents the candidate deviation input returned by the 
solver, and the response obtained from both NTP servers 
when that candidate deviation input was sent to them dur- 
ing the validation phase. 


Applications to error detection and fingerprint gener- 
ation. The results in Figure 7 show that the candidate 
deviation input returned by the solver in Case 8 has dif- 
ferent values at bytes 0, 2 and 3. First, bytes 2 and 3 have 
been zeroed out in the candidate deviation input. This 
is not relevant since these bytes represent the “Poll” and 
“Precision” fields and are only significant in messages 
sent by servers, not in the queries sent by the clients, and 
thus can be ignored. 

The important difference is on byte 0, which is pre- 
sented in detail on the right hand side of Figure 7. Byte 


0 contains three fields: “Leap Indicator” (LI), “Version” 
(VN) and “Mode” (MD) fields. The difference with the 
original request is in the Version field. The candidate de- 
viation input has a decimal value of 0 for this field (note 
that the field length is 3 bits), instead of the original dec- 
imal value of 4. When this candidate deviation input was 
sent to both servers, Ntpd ignored it, choosing not to re- 
spond, while NetTime responded with a version number 
with value 0. Thus, this candidate deviation input leads 
the two servers into semantically different output states. 


We check the specification for this case to find out 
that a zero value for the Version field is reserved, and 
according to the latest specification should no longer be 
supported by current and future NTP/SNTP servers [38]. 
However, the previous specification states that the server 
should copy the version number received from the client 
in the request, into the response, without dictating any 
special handling for the zero value. Since both imple- 
mentations seem to be following different versions of the 
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Trace-to-IR time | % of Symbolic Instructions | IR-to-formula time 
S187 9786 
25628 


2789 
0.073% 1695 





5059 


Table 5: Execution time and formula size obtained during the formula extraction phase. 


| ~——_*[- Input Calculation Time 
Apache - Miniweb 
Apache = Savant 


NetTime - Nipd 





Table 6: Execution time needed to calculate a candidate 
deviation input for each server pair. 


specification, we cannot definitely assign this error to one 
of the specifications. Instead, this example shows that 
we can identify inconsistencies or ambiguity in protocol 
specifications. In addition, we can use this query as a 
fingerprint to differentiate between the two implementa- 
tions. 


5.3. Performance 


In this section, we measure the execution time and the 
output size at different steps in our approach. The re- 
sults from the formula extraction phase and the deviation 
detection phase are shown in Table 5 and Table 6, respec- 
tively. In Table 5, the column “Trace-to-IR time” shows 
the time spent in converting an execution trace into our 
IR program. The values show that the time spent to con- 
vert the execution trace is significantly larger for the web 
servers, when compared to the time spent on the NTP 
servers. This is likely due to a larger complexity of the 
HTTP protocol, specifically a larger number of condi- 
tions affecting the input. This is shown in the second 
column as the percentage of all instructions that operate 
on symbolic data, i.e., on data derived from the input. 
The “IR-to-formula time” column shows the time spent 
in generating a symbolic formula from the IR program. 
Finally, the “Formula Size” column shows the size of 
the generated symbolic formulas, measured by the num- 
ber of nodes that they contain. The formula size shows 
again the larger complexity in the HTTP implementa- 
tions, when compared to the NTP implementations. 

In Table 6, we show the time used by the solver in the 
deviation detection phase to produce a candidate devi- 
ation input from the combined symbolic formula. The 
results show that our approach is very efficient in dis- 


covering deviations. In many cases, we can discover de- 
viation inputs between two implementations in approxi- 
mately one minute. Fuzz testing approaches are likely to 
take much longer, since they usually need to test many 
more examples. 


6 Discussion and Future Work 


Our current implementation is only a first step. In this 
section we discuss some natural extensions that we plan 
to pursue in the future. 


Addressing other protocol interactions. Currently, 
we have evaluated our approach over protocols that use 
request/response interactions (e.g. HTTP, NTP), where 
we examine the request being received by a server pro- 
gram. Note that our approach could be used in other 
scenarios as well. For example, with clients programs, 
we could analyze the response being received by the 
client. In protocol interactions involving multiple steps, 
we could consider the protocol output state to be the state 
of the program after the last step is finished. 


Covering rarely used paths. Some errors are hidden 
in rarely used program paths and finding them can take 
multiple iterations in our approach. For each iteration, 
we need a protocol input that drives both implementa- 
tions to semantically equivalent output states. These pro- 
tocol inputs are usually obtained from a network trace. 
Thus, the more different inputs contained in the trace 
the more paths we can potentially cover. In addition, 
we can query the solver for multiple candidate deviation 
inputs, each time requiring the new candidate input to 
be different than the previous ones. The obtained candi- 
date inputs often result in different paths. We have done 
work on symbolic execution techniques to explore mul- 
tiple program paths and plan to apply those techniques 
here [2, 17]. 


Creating formulas including multiple paths. In this 
paper, we apply the weakest precondition on IR pro- 
grams that contain a single program path, i.e., the pro- 
cessing of the original input by one implementation. 
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However, our weakest precondition algorithm is capable 
of handling IR programs containing multiple paths [19]. 
In the future, we plan to explore how to create formulas 
that include multiple paths. 


On-line formula generation. Our current implemen- 
tation for generating the symbolic formula works offline. 
We first record an execution trace for each implementa- 
tion while it processes an input. Then, we process the 
execution trace by converting it into the IR representa- 
tion, and computing the symbolic formula. Another al- 
ternative would be to generate the symbolic formulas in 
an on-line manner as the program performs operations 
on the received input, as in BitScope [2, 17]. 


7 Related Work 


Symbolic execution & weakest precondition. Sym- 
bolic execution was first proposed by King [34], and 
has been used for a wide variety of problems includ- 
ing generating vulnerability signatures [18], automatic 
test case generation [32], proving the viability of evasion 
techniques [35], and finding bugs in programs [21,47]. 
Weakest precondition was originally proposed for devel- 
oping correct programs from the ground up [24, 26]. It 
has been used for different applications including finding 
bugs in programs [28] and for sound replay of application 
dialog [42]. 


Static source code analysis. Chen et al. [23] manually 
identify rules representing ordered sequences of security - 
relevant operations, and use model checking techniques 
to detect violations of those rules in software. Udrea et 
al. [45] use static source code analysis to check if a C im- 
plementation of a protocol matches a manually specified 
rule-based specification of its behavior. 

Although these techniques are useful, our approach is 
quite different. Instead of comparing an implementation 
to a manually defined model, we compare implementa- 
tions against each other. Another significant difference 
is that our approach works directly on binaries, and does 
not require access to the source code. 


Protocol error detection. There has been considerable 
research on testing network protocol implementations, 
with heavy emphasis on automatically detecting errors 
in network protocols using fuzz testing [3-6, 9, 12, 14, 
33, 37,43, 46]. Fuzz testing is a technique in which ran- 
dom or semi-random inputs are generated and fed to the 
program under study, while monitoring for unexpected 
program output, usually an unexpected final state such 
as program crash or reboot. 


Compared to fuzz testing, our approach is more effi- 
cient for discovering deviations since it requires testing 
far fewer inputs. It can detect deviations by comparing 
how two implementations process the same input, even 
if this input leads both implementation to semantically 
equivalent states. In contrast, fuzz testing techniques 
need observable differences between implementations to 
detect a deviation. 


There is a line of research using model checking 
to find errors in protocol implementations. Musuvathi 
et.al. [40,41] use a model checker that operates directly 
on C and C++ code and use it to check for errors in 
TCP/IP and AODV implementations. Chaki et al. [22] 
build models from implementations and checks it against 
a specification model. Compared to our approach, these 
approaches need reference models to detect errors. 


Protocol fingerprinting. There has also been previous 
research on protocol fingerprinting [25, 44] but available 
fingerprinting tools [8, 11,15] use manually extracted fin- 
gerprints. More recently, automatic fingerprint genera- 
tion techniques, working only on network input and out- 
put, have been proposed [20]. Our approach is different 
in that we use binary analysis to generate the candidate 
inputs. 


8 Conclusion 


In this paper, we have presented a novel approach to au- 
tomatically detect deviations in the way different imple- 
mentations of the same specification check and process 
their input. Our approach has several advantages: (1) by 
automatically building the symbolic formulas from the 
implementation, our approach is precisely truthful to the 
implementation; (2) automatically identifying the devia- 
tion by solving formulas generated from the two imple- 
mentations enables us to find the needle in the haystack 
without having to try each straw (input) individually, thus 
a tremendous performance gain; (3) our approach works 
on binaries directly, i.e., without access to source code. 
We then show how to apply our automatic deviation tech- 
niques for automatic error detection and automatic fin- 
gerprint generation. 


We have presented our prototype system to evaluate 
our techniques, and have used it to automatically dis- 
cover deviations in multiple implementations of two dif- 
ferent protocols: HTTP and NTP. Our results show that 
our approach successfully finds deviations between dif- 
ferent implementations, including errors in input check- 
ing, and differences in the interpretation of the specifica- 
tion, which can be used as fingerprints. 
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Abstract 


In this paper we describe bugs and ways to attack 
trusted computing systems based on a Static root of trust 
such as Microsoft’s Bitlocker. We propose to use the dy- 
namic root of trust feature of newer x86 processors as 
this shortens the trust chain, can minimize the Trusted 
Computing Base of applications and is less vulnerable 
to TPM and BIOS attacks. To support our claim we 
implemented the Open Secure LOader (OSLO), the first 
publicly available bootloader based on AMDs skinit 
instruction. 


1 Introduction 


An increasing number of Computing Platforms with 
a Trusted Platform Module (TPM) [33] are deployed. 
Applications using these chips are not widely used yet 
[5,37]. This will change rapidly with the distribution 
of Microsoft’s Bitlocker [2], a disk encryption utility 
which is part of Windows Vista Ultimate. As the trusted 
computing technology behind these applications is quite 
new, there is not much experience concerning the secu- 
rity of trusted computing systems. In this context we 
analyzed the security of TPMs, BIOSes and bootloaders 
that consitute the basic building blocks of trusted com- 
puting implementations. Furthermore, we propose a de- 
sign that can improve the security of such implementa- 
tions. 


1.1 Trusted Computing 


Trusted Computing [9, 23, 25, 33] is a technology that 
tries two answer two questions: 


e Which software is running on a remote computer? 
(Remote Attestation) 


e How to ensure that only a particular software stack 
can access a stored secret? (Sealed Memory) 


Different scenarios can be built on top of trusted com- 
puting, for example, multi-factor authentication [37], 
hard disk encryption [2,5] or the widely disputed Digital 
Rights Management. All of these applications are based 
on a small chip: the Trusted Platform Module (TPM). 


1.2 Technical Background 


As defined by the Trusted Computing Group (TCG), 
a TPM is a smartcard-like low performance crypto- 
graphic coprocessor. It is soldered! on various moth- 
erboards. In addition to cryptographic operations such 
as signing and hashing, a TPM can store hashes of the 
boot sequence in a set of Platform Configuration Regis- 
ters (PCRs). 


A PCR is a 160 bit wide register that can hold an 
SHA-1 hash. It cannot be directly written. Instead, it 
can only be modified using the extend (x) operation. 
This operation calculates the new value of a PCR as an 
SHA-1 hash of the concatenation of the old value and x. 
The extend operation is used to store a hash of a chain 
of loaded software in PCRs. The chain starts with the 
BIOS and includes Option ROMs?, Bootloader, OS and 
applications. 


Using a challenge-response protocol, this trust chain 
can attest to a remote entity which software is running 
on the platform (remote attestation). Similarly it can be 
used to seal some data to a particular, not necessarily the 
currently running, software configuration. Unsealing the 
data is then only possible when this configuration was 
started. Figure 1 shows such a trust chain based on a 
Static Root of Trust for Measurement (SRTM), namely 
the BIOS. 
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TPM ==> BIOS => OptionROMs => 


BootLoader => OS ==> Application 


Figure 1. Typical trust chain ina TC system 


1.3. Chain of Hashes 


Three conditions must be met, to make a chain of 
hashes trustworthy: 


1. The first code running and extending PCRs after 
a platform reset (called SRTM) is trustworthy and 
cannot be replaced. 


2. The PCRs are not resetable, without passing con- 
trol to trusted code. 


3. The chain is contiguous. There is no code in- 
between that is executed but not hashed. 


The reasons behind these conditions are the follow- 
ing: If the initial code is not trustworthy or can be re- 
placed by untrustworthy code, it cannot be guaranteed 
that any hash value is correct. This code can in fact mod- 
ify any later running software to prevent the undesirable 
hashing. The second condition is quite similar and can 
be seen as a generalization of the first one. If PCRs are 
reset and untrustworthy code is running then any chain 
of hashes can be fabricated. The first two points de- 
scribe the beginning of the trust chain. The third point 
is needed to form a contiguous chain by recursion. It 
forces the condition that every program occupying the 
machine must be hashed, before it is executed. Other- 
wise, the trust chain is interrupted and unmeasured code 
can be running. Every program using sealed memory 
has to trust the code running before it to not open a hole 
in the chain. Similarly, a remote entity needs to find out 
during an attestation whether the trust chain presented 
by a trusted computing platform contains any hole in 
which untrusted code could be run. We will see later 
how current implementations do not meet the three con- 
ditions. 


Organization 


This paper is structured as follows. We describe bugs 
and ways to attack trusted computing systems based on 
SRTM in the next section. After that we present the de- 
sign and describe the implementation of OSLO. A sec- 
tion evaluating the security achievements follows. The 
last section proposes future work and concludes. 


2 Security Analysis 
2.1 Bootloader Bugs 


We look at the three publicly available TPM-enabled 
bootloaders and analyze whether they violate the third 
condition of a trust chain, executing code that is not 
hashed. 

The very first publicly available trusted bootloader 
was part of the Bear project from Dartmouth College 
[19,20]. They enhanced Linux with a security module 
called Enforcer. This module checks for modification 
of files and uses the TPM to seal a secret key of an en- 
crypted filesystem. To boot the system they used a mod- 
ified version of LILO [7]. They extend LILO in two 
ways: the Master Boot Record hashes the rest of LILO 
and the loaded Linux kernel image is also hashed. Only 
the last part of the image, containing the kernel itself, is 
hashed here. But the first part of the image, containing 
the real-mode setup code, is executed. Hence, this vio- 
lates the third condition. A fix for this bug would be to 
hash every sector which gets loaded. 

A second trusted bootloader is a patched GRUB 
v0.97 from IBM Japan [21,36]. This bootloader is used 
in IBMs Integrity Measurement Architecture [28]. It has 
the same security flaw as our own experiments with a 
TCG enabled GRUB [16]: it loads files twice, first for 
extraction and later for hashing into a PCR. A cause for 
this bug lies certainly in the structure of GRUB. GRUB 
loads and extracts a kernel image at the same time in- 
stead of loading them completely into memory and ex- 
tracting them afterwards. This leads to the situation 
that measuring the file independently from loading is 
the easiest way for a programmer to add TCG support 
to GRUB. Such an implementation is unfortunately in- 
correct. As program code is loaded twice from disk or 
from a remote host over the network, an attacker who 
has physical access either to the disk or to the network 
can send different data at the second time. This violates 
again the third condition, as hashed and executed code 
may differ. 

Another GRUB based trusted bootloader called 
TrustedGRUB [35] solves this issue in a recent version 
by moving the hash code to a lower level. Hashing is 
simply done on each read () call that loads data from 
disk or network, before the actual data is returned to the 
caller. The hash is then used after loading a kernel to 
extend a PCR. 

The current version 1.0-rc5 of TrustedGRUB (August 
2006) contains at least two other bugs. The hashing of 
its own code when starting from hard disk is broken. The 
corresponding PCR is never extended and always zero. 
Furthermore TrustedGRUB never contained any code to 
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use it securely from a CD. Nevertheless, it is used on a 
couple of LiveCDs [6]. 

All publicly available TPM-enabled bootloaders vio- 
late the third assumption, which makes systems booted 
by them unable to prove their trustworthiness. To an- 
alyze this it was not necessary to look at more sophis- 
ticated attack points such as missing range checks or 
buffer overflows. Both of these will become more in- 
teresting if the aforementioned bugs are fixed. 


2.2 TPM Reset 


In July 2004 we discovered that setting the reset bit 
in a control register of a v1.1 TPM? resets the chip 
without resetting the whole platform. This violates the 
second condition. As it results in default PCR values, 
this breaks the remote attestation and sealing features of 
those chips: Any PCR value can be reproduced without 
the opportunity for a remote entity to see the difference 
via remote attestation. Unsealing protected secrets of 
a security critical program is possible after resetting as 
well. The reset feature was added for maintenance rea- 
sons but does not have broad security consequences, be- 
cause sealing and remote attestation are not used in any 
product application with v1.1 chips. Instead the chips 
are solely used as smartcard for signing and key man- 
agement. 

This case demonstrates the security risk of a reset- 
table TPM. As other chips have different interfaces and 
can therefore not be reset in the same way, we exper- 
imented with a simple hardware attack. The Low Pin 
Count (LPC) bus was the point of attack. Most TPMs 
are connected to the southbridge through it and the bus 
has a separate reset line. We used different TPMs on 
external daughterboards for this experiment. 

By physically connecting the LRESET# pin to 
ground we were able to perform a reset of the chip 
itself. We separated the pin from the bus as other- 
wise the PS/2 keyboard controller received such a re- 
set signal, too. We had to reinitialize the chip which 
we did by reloading the driver and then sending a 
TPMStartup (TPM_CLEAR) to the chip. This pro- 
cess gave us an activated and enabled TPM in a state 
normally only visible to the BIOS: As expected all PCRs 
were in their default state. We presume that this attack 
could be mounted against any TPM in a similar way. 

The simplicity of the reset makes this hardware at- 
tack a threat to trusted computing systems. In particular 
in use cases where physical access, for example, through 
theft, can not be excluded. This attack also affects an- 
other use case of trusted computing, the widely disputed 
Digital Rights Management scenario where the owner 
of a device is untrusted and can use the system unin- 





tendedly. 

We have to admit that the TCG does not claim to 
protect against hardware attacks. But scenarios using 
trusted computing technology have to be aware of these 
restrictions. 


2.3. BIOS Attack 


We have shown that bootloader and TPM implemen- 
tations have some weaknesses. Now we look at the en- 
tity in-between them: the BIOS. 

The BIOS contains the Core Root of Trust for Mea- 
surement (CRTM), a piece of code that extends PCR 0 
initially. A CRTM has only to be exchanged with vendor 
signed code. Currently, the CRTM of many machines is 
freely patchable. It is stored in flash and no signature 
checking is performed on updates. This violates the first 
condition needed by a trust chain. 

We used a HP nx6325, a recent business notebook 
with a TPM v1.2, for this experiment. The fact that the 
BIOS is flashed from a raw image eased an attack. Other 
vendors are checking a hash before flashing the image to 
avoid transmission errors, a feature that is missing here. 
Checking a hash is irrelevant from a security point of 
view but it would make the following steps slightly more 
complicated, as we would have to recalculate the correct 
hash value. 

The part of the BIOS we choose to patch is the TPM 
driver. This has the advantage that all commands to the 
TPM, whether they come from the CRTM or from a 
bootloader through the INT 1Ah interface, can be in- 
tercepted. Our BIOS has only a memory-present TPM 
driver. These drivers need access to main memory for 
execution and can therefore only run after the BIOS has 
initialized the RAM. The interface of the TPM drivers 
are defined in the TCG PC client specification for con- 
ventional BIOS [34]. The function that we want to 
disable is MPTPMTransmit () which transmits com- 
mands to the TPM. We found the TPM driver in the 
BIOS binary quite easily. Strings like ’ TPM’ and the 
magic number of the code block as well as character- 
istic mnemonics (e.g., in and out) in the disassembly 
point to it. 

Figure 2 shows the start of the BIOS TPM driver. It 
starts with a magic number and entry point, both as de- 
fined in the specification. The code itself starts at ad- 
dress 0x28. We now search for an instruction that al- 
lows us to disable MPTPMTransmit (). The first in- 
structions of the driver are quite uninteresting. They 
just save some registers to the stack and calculate the 
drivers starting address in register edi in order to make 
the code position independent. The first interesting in- 
struction is the comparison at address 0x3a. By look- 
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0 aa 55 /* magic number x/ 
4 28 00 /* entry point x/ 
28: 57 push sedi 
29% 6 96 push sesi 
2a: 53 push Sebx 
2b: 33: ££ xor sedi, sedi 
2d: e8 00 00 call 0x32 
00 00 
822 - ot pop sedi 
33: 81 ef 33 sub $0x33,%edi 
00 00 
B92)" “4d inc sedi 
3a: 3c 04 cmp $0x4,%al 
3G: 74-23 je 0x61 
3e: 3c O1 cmp $0x1,%al 
40: 74 Oa je Ox4c 


Figure 2. Start of BIOS TPM driver 


ing further into the disassembly we found out that this 
instruction is part of the branch where the code distin- 
guishes between MPTPMTransmit () (where al=4) 
and other functions. By changing this comparision to 
cmp $0x14,%al, which just requires to flip a single 
bit, we can avoid that the branch at Ox3c is taken and 
any command is transmited to the TPM. An error code 
is returned to the caller instead. 

We now have to flash the BIOS with this modified im- 
age. As there is no hash of the BIOS image checked dur- 
ing flashing we use the normal BIOS update procedure. 
After a reboot we have a TPM in its default power-on 
state, without any PCR extensions. 

The ability to easily exchange the CRTM violates the 
TCG specifications. A result of this bug is that the trust 
into these machines can not be brought back anymore 
without an expensive certification process. 


2.4 Summary 


We found weaknesses in bootloaders and the possi- 
bility of a simple hardware attack against TPMs. Fur- 
thermore by just flipping a single bit we disabled the 
CRTM and any PCR extension from the BIOS. These 
cases show that current implementations do not meet all 
three conditions of a trust chain. 

In summary, we conclude that current BIOSes and 
bootloaders are not able to start systems in a trusthwor- 
thy manner. Moreover, TPMs are not protected against 
resets. 


TPM = OSLO => OS = Application 


Figure 3. Trust chain with a Dynamic Root 
of Trust for Measurement (DRTM) 


3 Design and Implementation of OSLO 
3.1 Using a DRTM 


The main idea behind a secure system with a re- 
settable TPM, an untrusted BIOS and a buggy boot- 
loader, is to use a Dynamic Root of Trust for Masure- 
ment (DRTM). A DRTM effectively removes the BIOS, 
OptionROMs and Bootloaders from the trust chain (cf. 
Figure 3). 

With a DRTM, the CPU can reset the PCR 17 at any 
time. This is provided through a new instruction that 
atomically initializes the CPU, loads a piece of code 
called Secure Loader (SL) into its cache, sends the code 
to the TPM to extend the reseted PCR 17, and transfers 
control to the SL. 

A design based on a DRTM is not vulnerable to the 
TPM reset attack because of a TPM property that can be 
easily missed. A TPM can distinguish between a reset 
and a DRTM due to CPU and chipset support. A reset 
of the TPM sets all PCRs to default values, which is 
“0” for the PCRs 0 - 16 and “-1” for PCR 17. Only a 
DRTM, with its special bus cycles, will reset the PCR 
17 to “O” and immediately extend it with the hash of 
the SL. Therefore, an attacker is unable to reset PCR 17 
to “O” and fake other platform configurations. Only by 
executing the skinit instruction it is possible to put 
the hash of an SL into PCR 17. An attacker can not hash 
an SL and directly afterwards executing code outside of 
it, since skinit jumps directly to the SL. 

An SL is also not affected by the BIOS attack. With 
the presence of a DRTM, the BIOS need not be trusted 
anymore to protect its CRTM and hash itself into the 
TPM. Nevertheless, a statement that claims the BIOS 
can be fully untrusted is oversimplified: We still have 
to trust the BIOS for providing the System Management 
Mode (SMM) code as well as correct ACPI tables. As 
both can be security critical, a hash of them should be 
incorporated at boot time into a PCR by the operating 
system. 


3.2 Implementation 
AMD provides a DRTM with its skinit instruc- 


tion which was introduced with the AMD-V extension 
[1]. On Intel CPUs, the Trusted Execution Technol- 
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ogy (TET) includes a DRTM with the senter instruc- 
tion [9, 14]. AMD was generous to provide us with an 
AMD.-V platform nearly one year earlier than we were 
able to buy an Intel TET platform. 

Our implementation, called OSLO (Open Secure 
LOader), is written in C with some small parts in as- 
sembler. As OSLO is part of the Trusted Computing 
Base (TCB) of all applications, we wanted to minimize 
the binary and source code size. Furthermore, we had 
to avoid any BIOS call, as otherwise the BIOS would be 
part of the TCB again. 

OSLO is started as kernel from a multi-boot compli- 
ant [22] loader. It initializes the TPM to be able to ex- 
tend a PCR with the hashes of further modules. After 
that other processors are stopped. This is required be- 
fore executing skinit and inhibits potential interfer- 
ences during the secure startup procedure. For example, 
malicious code running on a second CPU could modify 
the instructions of the Secure Loader. The cache consis- 
tency protocol would then propagate the changes to the 
other processor. 

Since the needed platform initialization is done, 
OSLO can now switch to the “secure mode” by execut- 
ing skinit. Before starting the first module as a new 
kernel, OSLO hashes every module that is preloaded 
from the parent boot-loader. 

We used chainloading via the multiboot specification 
to be flexible with respect to the operating system OSLO 
loads and who can load OSLO. Normally, this will be 
a multiboot-compliant loader started by the BIOS such 
as GRUB or SysLinux [31] but loading OSLO from the 
Linux kexec environment [17] should also be possible. 

As we could not rely on the BIOS for talking to the 
TPM, we also implemented our own TPM driver for 
v1.2 TPMs. As all of these TPMs should follow the 
TPM interface specification (TIS) only a single driver 
was needed. Using this memory mapped interface is, 
compared to the different interfaces needed to talk to the 
v1.1 TPMs, rather simple. Therefore our TPM driver 
consists of only 70 lines of code. 

Currently two features of OSLO are still unimple- 
mented: 


e protection against direct memory access (DMA) 
from malicious devices, and 


e extension of the TPM event log for remote attesta- 
tion. 


The TPM event log is used to ease remote attesta- 
tion. It can store hashes used as input for extend and 
optionally a string describing them. The log provides a 
breakdown of the PCR value into smaller known pieces. 
It is itself not security critical and therefore not protected 


by the bootloader or the operating system. An attacker 
can only perform Denail of Service attacks by for ex- 
ample overwriting the log. It is not possible to compro- 
mise the security of a remote attestation by modifying 
the log. The TPM event log makes it much easier for 
a remote entity to check a reported hash values against 
a list of good known values, for example if the order 
of the extends is not fixed. OSLO should extend the 
event log to support applications relying on it for remote 
attestation. 

The source code of OSLO is available under the 
terms of the GPL [24]. The source includes three addi- 
tional tools that can be multi-boot loaded after OSLO: 
Beirut to hash command lines, Pamplona to revert 
the steps done by skinit for booting OSLO unaware 
OSes, and Munich to start Linux from a multiboot en- 
vironment. 


3.3. Lessons Learned 


We have learned two lessons while implementing 
OSLO: 


e Itis hard to write secure initialization code, and 


e a secure loader needs to have platform specific 
knowledge. 


An example of the first lesson is our experience with 
the initialization of the Device Exclusion Vector (DEV) 
on AMD CPUs. A DEV is a bitvector in physical mem- 
ory that consists of one bit per physical 4k-page. A bit 
in this vector decides whether device based DMA trans- 
fers to or from the corresponding page is allowed. DEVs 
could be cached in the chipset for performance reasons. 
We found out that the DEV initialization, if it is done in 
the naive way, contains a race condition. 

DEV initialization is normally done in two steps: En- 
able the appropriate bits in the vector to protect itself 
and then flushing the chipset internal DEV cache. As 
these two operations are not atomic, a malicious device 
could change the DEV using DMA just before the vec- 
tor is loaded into the DEV cache. An implementation 
has to find a workaround for this race. A secure way to 
initialize DEV protection is, for example, to use an in- 
termediate DEV in the 64k of the secure loader thereby 
protecting the initialization of a final DEV. 

The second point is a little bit more complicated. 
DEVs can only protect against DMA from a device. If 
someone puts an operating system he wants to start with 
OSLO into device memory it cannot be protected from 
a malicious device. The OS is loaded and hashed by 
OSLO as if it would reside in RAM, but if it is read the 
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Name | size OSLOshal — shalsum 
kernel | 1.2 MB 0.070 sec 0.020 sec 
initrd | 4.2 MB 0.245 sec 0.064 sec 
sum 5.4 MB 0.315 sec 0.084 sec 


Figure 4. Performance of hashing a Linux 
kernel and Initrd 





Name | LOC binary inkb gzip inkb 
BIOS HP - 1024 491 
GRUB v0.97 | 19600 98 55 
OSLO v0.4.2 | 1534 4.1 2.9 


Figure 5. Size of BIOS, GRUB and OSLO 


second time, e.g., on ELF decoding or execution, it is re- 
quested from the device memory again. Because we do 
not trust a device to leave its memory unmodified, we 
cannot be sure that the code that is executed is identical 
to the hashed one. As a consequence we can only pro- 
tect, hash and start modules that are located in RAM. A 
secure loader therefore needs a reliable method to detect 
the distinction between RAM and device memory. 


4 Evaluation 


One of our design goals for OSLO was a minimal 
TCB size. Reducing the TCB is suitable for security sen- 
sitive applications as it increases the understandability 
and minimizes the number of possible bugs [30]. Fur- 
thermore, the process of formal verification will bene- 
fit from it. We achieved a minimal TCB by using two 
techniques: reducing functionality and trading size with 
performance penalties. 

An example for the first is that we do not rely on 
external libc code but use functions with limited func- 
tionality like out_string () instead of a full featured 
printf () implementation. 

We also implemented our own SHA-1 code trading 
size for performance. This resulted in an SHA-1 imple- 
mentation that compiles with gcc-3.4 to less than 512 
bytes. This is only a quarter of the size compared with a 
performance optimized version such as the one from the 
Linux kernel. This, on the other hand, makes the hash 
much slower. The Linux version has a throughput which 
is three to four times higher, due to, e.g., loop unrolling. 

Figure 4 shows that booting linux with our SHA-1 
implementation takes 0.315 seconds compared to 0.084 
seconds for a heavily optimized shalsum version. As 
booting a system usually takes minutes a performance 
penalty of 0.231 seconds is acceptable here. 

Figure 5 shows the source and binary sizes for BIOS, 


GRUB and OSLO. We also give the size of gzip com- 
pressed binaries in this table as this reduces the effect of 
empty sections in the images. Unfortunately, the source 
code of the HP BIOS is not available. A similar but older 
Award BIOS consists of around 150 thousand lines of 
assembler code. The numbers given for GRUB do not 
include the drivers used to boot from a network. Adding 
them would nearly double the given numbers. 

OSLO is an order of magnitude smaller than GRUB 
and two orders of magnitude smaller than the BIOS we 
examined. If we presume the principle more code equals 
more bugs and neglect the effect of a code size optimiz- 
ing compiler, we can deduce that OSLO has a signifi- 
cantly smaller number of bugs due to its size compared 
to GRUB or the BIOS. 

One could argue that in an ordinary system like Win- 
dows or Linux, where the TCB of an application consists 
of million lines of code with programs consuming tens 
or hundreds of megabytes, the size of GRUB and the 
BIOS does not matter. That is perhaps true, but as the 
trend in secure systems goes to small kernels and hy- 
pervisors [10, 13, 29, 32], architectures like L4/NIZZA 
or Xen can very well benefit from the TCB reduction 
through OSLO. 

In summary, OSLO promises a smaller attack surface 
due to its minimal size and since it uses a DRTM miti- 
gates the TPM reset and the BIOS attacks as outlined 
in Section 3.1. 


5 Related Work 


Previous research showed the vulnerability of trusted 
computing platforms against hardware attacks. Kursawe 
et al. [18] eavesdrop on the LPC bus to capture and anal- 
yse the communication between the CPU and the TPM. 
They only perform a passive attack, but describe that an 
active hardware attack on the LPC bus could be used to 
fool the TPM about the platform state. Untrusted code 
can then pretend to the TPM to be a DRTM. 

Limitations of the trusted computing specification 
and its implementations are described in the literature 
multiple times. Bruschi et al. [4] showed that an au- 
thorization protocol of TPMs is vulnerable to replay at- 
tacks. Sadeghi et al. [27] reported that many TPM im- 
plementations do not meet the TCG specification. Gar- 
riss et al. [8] found out that a public computing kiosk that 
uses remote attestation to prove which software is run- 
ning is vulnerable to boot-between attestation attacks. 
They suggest a reboot counter in the TPM to make re- 
boots visible to remote parties. Such a counter will not 
help against our TPM reset attack as it needs to detect 
whether a TPM was switched on later than the whole 
platform‘, a property a reboot counter cannot achieve. 
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There are more sophisticated BIOS attacks men- 
tioned in the literature. Heasman [12], for example, 
showed at the Blackhat Federal 2006 that a rootkit can 
be hidden in ACPI code which is usually stored in the 
BIOS. In a subsequent paper [11], he describes how a 
rootkit can persist in a system with a secured BIOS by 
using other flash chips. In both cases only TPM-less 
systems were considered. By combining our attack to 
disable the CRTM with Heasman’s work it seems possi- 
ble to hide a rootkit in the BIOS but report correct hash 
values to the TPM. 

To generally prevent BIOS attacks, Phoenix Tech- 
nologies offers a firmware called TrustedCore [26] that 
allows only signed updates. Intel Active Management 
Technology [15] has also this feature. 

Sailer et al. [28] describe an architecture for an in- 
tegrity measurement system for Linux using a static root 
of trust. As they focus on the enhancements of the 
operating system, the architecture is not limited to an 
SRTM. There implementation could easily benefit from 
the smaller attack surface of a secure loader like OSLO. 


6 Future Work and Conclusion 


OSLO is not feature complete yet. We plan to finish 
the implementation of the DMA protection. Moreover, 
we want to add ACPI event-log support. This should 
allow the integration of OSLO into larger projects that 
use the event-log for remote attestation. 

A port of OSLO to use the sent er instruction on an 
Intel TET platform could demonstrate that the mulitboot 
chainloader design is portable or show that sent er im- 
plies an integrated design as it is proposed for Xen [38]. 

The search for new attack points of other trusted com- 
puting implementations is also part of our future work. 

It was not necessary to look at more sophisticated at- 
tack points such as buffer overflows or the strength of 
cryptographic algorithms to find the bugs and attacks we 
presented in this paper. If we compare this to a simi- 
lar analysis of another secure system, such as the one 
of an RFID chip [3], we have to conclude that current 
trusted computing implementations are not resilient to 
even simple attacks. Moreover, the current implemen- 
tations do not meet the assumptions of a secure design. 
Even a small bug in them can compromise the additional 
security obtained by a TPM. 

We suspect that most of the platforms are vulnera- 
ble to the TPM reset and many of them to the BIOS at- 
tack. As a consequence the software still based on an 
SRTM, such as Microsoft’s Bitlocker, cannot provide 
secure TPM-driven encryption and attestation on these 
systems. 


A switch to a DRTM based OSLO-like approach can 
shorten the trust chain, minimize the TCB, and is less 
vulnerable to TPM and BIOS attacks. 
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Notes 


'There exist also TPMs on daughterboards. Their security value is 
limited as exchanging them is quite easy. 

Firmware on adapter cards 

31t would be quite unfair to disclose the vendor name here. 

4e.g., by holding the reset line of a TPM while powering the ma- 
chine up 
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Abstract 


We describe a “cheat” attack, allowing an ordinary pro- 
cess to hijack any desirable percentage of the CPU 
cycles without requiring superuser/administrator privi- 
leges. Moreover, the nature of the attack is such that, 
at least in some systems, listing the active processes will 
erroneously show the cheating process as not using any 
CPU resources: the “missing” cycles would either be at- 
tributed to some other process or not be reported at all (if 
the machine is otherwise idle). Thus, certain malicious 
operations generally believed to have required overcom- 
ing the hardships of obtaining root access and installing a 
rootkit, can actually be launched by non-privileged users 
in a straightforward manner, thereby making the job of a 
malicious adversary that much easier. We show that most 
major general-purpose operating systems are vulnerable 
to the cheat attack, due to a combination of how they ac- 
count for CPU usage and how they use this information 
to prioritize competing processes. Furthermore, recent 
scheduler changes attempting to better support interac- 
tive workloads increase the vulnerability to the attack, 
and naive steps taken by certain systems to reduce the 
danger are easily circumvented. We show that the attack 
can nevertheless be defeated, and we demonstreate this 
by implementing a patch for Linux that eliminates the 
problem with negligible overhead. 


Prologue 


Some of the ideas underlying the cheat attack were im- 
plemented by Tsutomu Shimomura circa 1980 at Prince- 
ton, but it seems there is no published or detailed essay 
on the topic, nor any mention of it on the web [54]. Re- 
lated publications deal solely with the fact that general- 
purpose CPU accounting can be inaccurate, but never 
conceive this can be somehow maliciously exploited (see 
Section 2.3). Recent trends in mainstream schedulers 
render a discussion of the attack especially relevant. 
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1 Introduction 


An attacker can be defined as one that aspires to per- 
form actions “resulting [in the] violation of the explicit 
or implicit security policy of a system”, which if suc- 
cessful, constitute a breach [31]. Under this definition, 
the said actions may be divided into two classes. One is 
of hostile actions, e.g. unlawful reading of sensitive data, 
spamming, lunching of DDoS attacks, etc. The other is 
of concealment actions. These are meant to prevent the 
hostile actions from being discovered, in an effort to pro- 
long the duration in which the compromised machine can 
be used for hostile purposes. While not hostile, conceal- 
ment actions fall under the above definitions of “attack” 
and “breach”, as they are in violation of any reasonable 
security policy. 

The “cheat” attack we describe embodies both a hos- 
tile and a concealment aspect. In a nutshell, the attack 
allows to implement a cheat utility such that invoking 


cheat p prog 


would run the program prog in such a way that it is allo- 
cated exactly p percent of the CPU cycles. The hostile as- 
pect is that p can be arbitrarily big (e.g. 95%), but prog 
would still get that many cycles, regardless of the pres- 
ence of competing applications and the fairness policy of 
the system. The concealment aspect is that prog would 
erroneously appear as consuming 0% CPU in monitor- 
ing tools like ps, top, xosview, etc. In other words, the 
cheat attack allows a program to (1) consume CPU cy- 
cles in a secretive manner, and (2) consume as many of 
these as it wants. This is similar to the common secu- 
rity breach scenario where an attacker manages to obtain 
superuser privileges on the compromised machine, and 
uses these privileges to engage in hostile activities and to 
conceal them. But in contrast to this common scenario, 
the cheat attack requires no special privileges. Rather, it 
can be launched by regular users, deeming this important 
line of defense (of obtaining root or superuser privileges) 
as irrelevant, and making the job of the attacker signifi- 
cantly easier. 
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Concealment actions are typically associated with 
rootkits, consisting of “a set of programs and code that 
allows a permanent or consistent, undetectable presence 
on a computer” [25]. After breaking into a computer 
and obtaining root access, the intruder installs a rootkit 
to maintain such access, hide traces of it, and exploit 
it. Thus, ordinarily, the ability to perform concealment 
actions (the rootkit) is the result of a hostile action (the 
break-in). In contrast, with the cheat attack, it is exactly 
the opposite: the concealment action (the ability to ap- 
pear as consuming 0% CPU) is actually what makes it 
possible to perform the hostile action (of monopolizing 
the CPU regardless of the system’s fairness policy). We 
therefore begin by introducing the OS mechanism that 
allows a non-privileged application to conceal the fact it 
is using the CPU. 


1.1 Operating System Ticks 


A general-purpose operating system (GPOS) typically 
maintains control by using periodic clock interrupts. 
This practice started at the 1960s [12] and has contin- 
ued ever since, such that nowadays it is used by most 
contemporary GPOSs, including Linux, the BSD fam- 
ily, Solaris, AIX, HPUX, IRIX, and the Windows family. 
Roughly speaking, the way the mechanism works is that 
at boot-time the kernel sets a hardware clock to gener- 
ate periodic interrupts at fixed intervals (every few mil- 
liseconds; anywhere between Ims to 15ms, depending 
on the OS). The time instance at which the interrupt fires 
is called a tick, and the elapsed time between two consec- 
utive ticks is called a tick duration. The interrupt invokes 
a kernel routine, called the tick handler that is responsi- 
ble for various OS activities, of which the following are 
relevant for our purposes: 


1. Delivering timing services and alarm signals. For 
example, a movie player that wants to wakeup on 
time to display the next frame, requests the OS (us- 
ing a system call) to wake it up at the designated 
time. The kernel places this request in an internal 
data structure that is checked upon each tick. When 
the tick handler discovers there’s an expired alarm, 
it wakes the associated player up. The player then 
displays the frame and the scenario is repeated until 
the movie ends. 


2. Accounting for CPU usage by recording that the 
currently running process S' consumed CPU cycles 
during the last tick. Specifically, on every tick, S 
is stopped, the tick-handler is started, and the ker- 
nel increments S’s CPU-consumption tick-counter 
within its internal data structure. 


3. Initiating involuntary preemption and thereby 
implementing multitasking (interleaving of the 


CPU between several programs to create the illu- 
sion they execute concurrently). Specifically, after 
S is billed for consuming CPU during the last tick, 
the tick handler checks whether S' has exhausted its 
“quantum”, and if so, S is preempted in favor of 
another process. Otherwise, it is resumed. 


1.2 The Concealment Component 


The fundamental vulnerability of the tick mechanism lies 
within the second item above: CPU billing is based on 
periodic sampling. Consequently, if S can somehow 
manage to arrange things such that it always starts to run 
just after the clock tick, and always goes to sleep just be- 
fore the next one, then S will never be billed. One might 
naively expect this would not be a problem because ap- 
plications cannot request timing services independent of 
OS ticks. Indeed, it is technically impossible for non- 
privileged applications to request the OS to deliver alarm 
signals in between ticks. Nevertheless, we will show that 
there are several ways to circumvent this difficulty. 

To make things even worse, the cheat attack leads to 
misaccounting, where another process is billed for CPU 
time used by the cheating process. This happens because 
billing is done in tick units, and so whichever process 
happens to run while the tick takes place is billed for 
the entire tick duration, even if it only consumed a small 
fraction of it. As a result, even if the system administra- 
tors suspect something, they will suspect the wrong pro- 
cesses. If a cheating process is not visible through system 
monitoring tools, the only way to notice the attack is by 
its effect on throughput. The cheater can further disguise 
its tracks by moderating the amount of CPU it uses so as 
not to have too great an impact on system performance. 


1.3. The Hostile Component 


The most basic defense one has against malicious pro- 
grams is knowing what’s going on in the system. Thus, a 
situation in which a non-privileged application can con- 
ceal the fact it makes use of the CPU, constitutes a seri- 
ous security problem in its own right. However, there is 
significantly more to cheat attacks than concealment, be- 
cause CPU accounting is not conducted just for the sake 
of knowing what’s going on. Rather, this information has 
a crucial impact on scheduling decisions. 

As exemplified in Section 5, the traditional design 
principle underlying general-purpose scheduling (as op- 
posed to research or special-purpose schemes) is the 
same: the more CPU cycles used by a process, the lower 
its priority becomes [15]. This negative feedback (run- 
ning reduces priority to run more) ensures that (1) all 
processes get a fair share of the CPU, and that (2) pro- 
cesses that do not use the CPU very much — such as 
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I/O bound processes — enjoy a higher priority for those 
bursts in which they want it. In fact, the latter is largely 
what makes text editors responsive to our keystrokes in 
an overloaded system [14]. 

The practical meaning of this is that by consistently 
appearing to consume 0% CPU, an application gains a 
very high priority. As a consequence, when a cheating 
process wakes up and becomes runnable (following the 
scenario depicted in the previous subsection) it usually 
has a higher priority than that of the currently running 
process, which is therefore immediately preempted in fa- 
vor of the cheater. Thus, as argued above, unprivileged 
concealment capabilities indeed allow an application to 
monopolize the CPU. However, surprisingly, this is not 
the whole story. It turns out that even without conceal- 
ment capabilities it is still sometimes possible for an ap- 
plication to dominate the CPU without superuser privi- 
leges, as discussed next. 


1.4 The Interactivity Component and the 
Spectrum of Vulnerability to Cheating 


Not all GPOSs are vulnerable to cheat attacks to the same 
degree. To demonstrate, let us first compare between 
Linux-2.4 and Linux-2.6. One of the radical differences 
between the two is the scheduling subsystem, which has 
been redesigned from scratch and undergone a complete 
rewrite. A major design goal of the new scheduler was to 
improve users’ experience by attempting to better iden- 
tify and service interactive processes. In fact, the lead 
developer of this subsystem argued that “the improve- 
ment in the way interactive tasks are handled is actually 
the change that should be the most noticeable for ordi- 
nary users” [3]. Unfortunately, with this improvement 
also came increased vulnerability to cheat attacks. 

In Linux-2.6, a process need not conceal the fact it is 
using the CPU in order to monopolize it. Instead, it can 
masquerade as being “interactive”, a concept that is tied 
within Linux-2.6 to the number of times the process vol- 
untarily sleeps [32]. Full details are given in Section 6, 
but in a nutshell, to our surprise, even after we introduced 
cycle-accurate CPU accounting to the Linux-2.6 kernel 
and made the cheating process fully “visible” at all times, 
the cheater still managed to monopolize the CPU. The 
reason turned out to be the cheater’s many short volun- 
tary sleep-periods while clock ticks take place (as spec- 
ified in Section 1.2). This, along with Linux-2.6’s ag- 
gressive preference of “interactive” processes yielded the 
new weakness. 

In contrast, the interactivity weakness is not present 
in Linux-2.4, because priorities do not reflect any con- 
siderations that undermine the aforementioned negative 
feedback. Specifically, the time remaining until a process 
exhausts its allocated quantum also serves as its priority, 


and so the negative feedback is strictly enforced [36]. In- 
deed, having Linux-2.4 use accurate accounting informa- 
tion defeats the cheat attack. 

The case of Linux 2.4 and 2.6 is not an isolated inci- 
dent. It is analogous to the case of FreeBSD and the two 
schedulers it makes available to its users. The default 
“4BSD” scheduler [5] is vulnerable to cheat attacks due 
to the sampling nature of CPU accounting, like Linux- 
2.4. The newer “ULE” scheduler [42] (designated to re- 
place 4BSD) attempts to improve the service provided 
to interactive processes, and likewise introduces an addi- 
tional weakness that is similar to that of Linux-2.6. We 
conclude that there’s a genuine (and much needed) intent 
to make GPOSs do a better job in adequately supporting 
newer workloads consisting of modern interactive appli- 
cations such as movie players and games, but that this 
issue is quite subtle and prone to errors compromising 
the system (see Section 5.2 for further discussion of why 
this is the case). 

Continuing to survey the OS spectrum, Solaris repre- 
sents a different kind of vulnerability to cheat attacks. 
This OS maintains completely accurate CPU account- 
ing (which is not based on sampling) and does not suffer 
from the interactivity weakness that is present in Linux- 
2.6 and FreeBSD/ULE. Surprisingly, despite this con- 
figuration, it is still vulnerable to the hostile component 
of cheating. The reason is that, while accurate informa- 
tion is maintained by the kernel, the scheduling subsys- 
tem does not make use of it (!). Instead, it utilizes the 
sampling-based information gathered by the periodic tick 
handler [35]. This would have been acceptable if all ap- 
plications “played by the rules” (in which case periodic 
sampling works quite well), but such an assumption is 
of course not justified. The fact that the developers of 
the scheduling subsystems did not replace the sampled 
information with the accurate one, despite its availabil- 
ity, serves as a testament of their lack of awareness to the 
possibility of cheat attacks. 

Similarly to Solaris, Windows XP maintains accurate 
accounting that is unused by the scheduler, which main- 
tains its own sample-based statistics. But in contrast to 
Solaris, XP also suffers from the interactivity weakness 
of Linux 2.6 and ULE. Thus, utilizing the accurate infor- 
mation would have had virtually no effect. 

From the seven OS/scheduler pairs we have examined, 
only Mac OS X was found to be immune from the cheat 
attack. The reason for this exception, however, is not a 
better design of the tick mechanism so as to avoid the at- 
tack. Rather, it is because Mac OS X uses a different tim- 
ing mechanism altogether. Similarly to several realtime 
OSs, Mac OS X uses one-shot timers to drive its timing 
and alarm events [29, 47, 20]. These are hardware inter- 
rupts that are set to go off only for specific needs, rather 
than periodically. With this design, the OS maintains an 
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Figure |: Classification of major operating systems in terms of features relevant for the cheat attack. General-purpose OSs are 
typically tick-based, in which case they invariably use sampling and are therefore vulnerable to cheat attacks to various degrees. 


ascending list of outstanding timers and sets a one-shot 
event to fire only when it is time to process the event at 
the head of the list; when this occurs, the head is poped, 
and a new one-shot event is set according to the new 
head. This design is motivated by various benefits, such 
as reduced power consumption in mobile devices [40], 
better alarm resolution [15], and less OS “noise” [53]. 
However, it causes the time period between two consec- 
utive timer events to be variable and unbounded. As a 
consequence, CPU-accounting based on sampling is no 
longer a viable option, and the Mac OS X immunity to 
cheat attacks is merely a side effect of this. Our findings 
regarding the spectrum of vulnerabilities to cheat attacks 
are summarized in Fig. 1. 

While it is possible to rewrite a tick-based OS to be 
one-shot, this is a non-trivial task requiring a radical 
change in the kernel (e.g. the Linux-2.6.16 kernel source 
tree contains 8,997 occurrences of the tick frequency HZ 
macro, spanning 3,199 files). Worse, ticks have been 
around for so long, that some user code came to directly 
rely on them [52]. Luckily, eliminating the threat of 
cheat attacks does not necessitate a radical change: there 
exists a much simpler solution (Section 6). Regardless, 
the root cause of the problem is not implementation dif- 
ficulties, but rather, lack of awareness. 


1.5 Roadmap 


This paper is structured as follows. Section 2 places the 
cheat attack within the related context and discusses the 
potential exploits. Section 3 describes in detail how to 
implement a cheating process and experimentally evalu- 
ates this design. Section 4 further shows how to apply 
the cheating technique to arbitrary applications, turning 
them into “cheaters” without changing their source code. 
Section 5 provides more details on contemporary sched- 
ulers and highlights their weaknesses in relation to the 
cheat attack on an individual basis. Section 6 describes 
and evaluates our solution to the problem, and Section 7 
concludes. 


2 Potential Exploits and Related Work 


2.1 The Privileges-Conflict Axis 


The conflict between attackers and defenders often re- 
volves around privileges of using resources, notably net- 
work, storage, and the CPU. The most aggressive and 
general manifestation of this conflict is attackers that as- 
pire to have all privileges and avoid all restrictions by ob- 
taining root/administrator access. Once obtained, attack- 
ers can make use of all the resources of the compromised 
machine in an uncontrolled manner. Furthermore, using 
rootkits, they can do so secretly in order to avoid detec- 
tion and lengthen the period in which the resources can 
be exploited. Initially, rootkits simply replaced various 
system programs, such as netstat to conceal network ac- 
tivity, Is to conceal files, and ps/top to conceal processes 
and CPU usage [55]. But later rootkits migrated into the 
kernel [9, 46] and underneath it [27], reflecting the rapid 
escalation of the concealment/detection battle. 

At the other end of the privileges conflict one can find 
attacks that are more subtle and limited in nature. For 
example, in order to take control over a single JVM in- 
stance running on a machine to which an attacker has 
no physical access, Govindavajhala and Appel suggest 
the attacker should “convince it [the machine] to run the 
[Java] program and then wait for a cosmic ray (or other 
natural source) to induce a memory error’; they then 
show that “a single bit error in the Java program’s data 
space can be exploited to execute arbitrary code with a 
probability of about 70%” within the JVM instance [21]. 
When successful, this would provide the attacker with 
the privileges of the user that spawned the JVM. 

When positioning the general vs. limited attacks at op- 
posite ends of the privileges-conflict “axis”, the cheat at- 
tack is located somewhere in between. It is certainly not 
as powerful as having root access and a rootkit, e.g. the 
attacker cannot manipulate and hide network activity or 
file usage. On the other hand, the attack is not limited to 
only one user application, written in a specific language, 
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on the condition of a low probability event such as a cos- 
mic ray flipping an appropriate bit. Instead, at its fullest, 
the cheat attack offers non-privileged users one generic 
functionality of a rootkit: A ubiquitous way to control, 
manipulate, and exploit one computer resource — CPU 
cycles — in a fairly secretive manner. In this respect, 
cheating is analogous to attacks like the one suggested 
by Borisov et al. that have shown how to circumvent 
the restrictions imposed by file permissions in a fairly 
robust way [8]. As with cheating, non-privileged users 
are offered a generic functionality of rootkits, only this 
time concerning files. An important difference, how- 
ever, is that Borisov’s attack necessitates the presence of 
a root setuid program that uses the access/open idiom 
(a widely discouraged practice [11] '), whereas our attack 
has no requirements but running under a ticking OS. 


2.2 Denying or Using the Hijacked Cycles 


Cheating can obviously be used for launching DoS at- 
tacks. Since attackers can hijack any amount of CPU 
cycles, they can run a program that uselessly consumes 
e.g. 25%, 50%, or 75% of each tick’s cycles, depend- 
ing on the extent to which they want to degrade the ef- 
fective throughput of the system; and with concealment 
capabilities, users may feel that things work slower, but 
would be unable to say why. This is similar to “shrew” 
and “RoQ” (Reduction of Quality) attacks that take ad- 
vantage of the fact that TCP interprets packet loss as an 
indication of congestion and halves a connection’s trans- 
mission rate in response. With well-timed low-rate DoS 
traffic patterns, these attacks can throttle TCP flows to 
a small fraction of their ideal rate while eluding detec- 
tion [28, 23, 50]. 

Another related concept is “parasitic computing’, with 
which one machine forces another to solve a piece of a 
complex computational problem merely by sending to it 
malformed IP packets and observing the response [6]. 
Likewise, instead of just denying the hijacked cycles 
from other applications, a cheating process can leverage 
them to engage in actual computation (but in contrast, it 
can do so effectively, whereas parasitic computing is ex- 
tremely inefficient). Indeed, Section 4 demonstrates how 
we secretly monopolized an entire departmental shared 
cluster for our own computational needs, without “doing 
anything wrong”. 

A serious exploit would occur if a cheat application 


'The access system call was designed to be used by setuid root 
programs to check whether the invoking user has appropriate permis- 
sions, before Opening a respective file. This induces a time-of-check- 
to-time-of-use (TOCTTOU) race condition whereby an adversary can 
make a name refer to a different file after the access and before the 
Open. Thus, its manual page states that “the access system call is 
a potential security hole due to race conditions and should never be 
used” [1]. 


was spread using a computer virus or worm. This po- 
tential development is very worrying, as it foreshadows 
a new type of exploit for computer viruses. So far, com- 
puter viruses targeting the whole Internet have been used 
mainly for launching DDoS attacks or spam email [34]. 
In many cases these viruses and worms were found and 
uprooted because of their very success, as the load they 
place on the Internet become unignorable [38]. But Stan- 
iford et al. described a “surreptitious” attack by which a 
worm that requires no special privileges can spread in 
a much harder to detect contagion fashion, without ex- 
hibiting peculiar communication pattens, potentially in- 
fecting upwards of 10,000,000 hosts [49]. Combining 
such worms with our cheat attack can be used to cre- 
ate a concealed ad-hoc supercomputer and run a compu- 
tational payload on massive resources in minimal time, 
harvesting a huge infrastructure similar to that amassed 
by projects like SETI@home [2]. Possible applications 
include cracking encryptions in a matter of hours or days, 
running nuclear simulations, and illegally executing a 
wide range of otherwise regulated computations. While 
this can be done with real rootkits, the fact it can also po- 
tentially be done without ever requiring superuser privi- 
leges on the subverted machines is further alarming. In- 
deed, with methods like Borisov’s (circumvent file per- 
missions [8]), Staniford’s (networked undetected conta- 
gion [49]), and ours, one can envision a kind of “rootkit 
without root privileges”. 


2.3. The Novelty of Cheating 


While the cheat attack is simple, to our knowledge, there 
is no published record of it, nor any mention of it on the 
web. Related publications point out that general-purpose 
CPU accounting might be inaccurate, but never raise the 
possibility that this can be maliciously exploited. Our 
first encounter with the attack was, interestingly, when it 
occurred by chance. While investigating the effect of dif- 
ferent tick frequencies [15], we observed that an X server 
servicing a Xine movie player was only billed for 2% of 
the cycles it actually consumed, a result of (1) X starting 
to run just after a tick (following Xine’s repetitive alarm 
signals to display each subsequent frame, which are de- 
livered by the tick handler), and (2) X finishing the work 
within 0.8 of a tick duration. This pathology in fact out- 
lined the cheat attack principles. But at the time, we did 
not realize that this can be maliciously done on purpose. 

We were not alone: There have been others that were 
aware of the accounting problem, but failed to realize the 
consequences. Liedtke argued that the system/user time- 
used statistics, as e.g. provided by the getrusage system 
call, might be inaccurate “when short active intervals are 
timer-scheduled, i.e. start always directly after a clock 
interrupt and terminate before the next one” [30] (exactly 
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describing the behavior we observed, but stopping short 
from recognizing this can be exploited). 

The designers of the FreeBSD kernel were also aware 
this might occur, contending that “since processes tend 
to synchronize to ’tick’, the statistics clock needs to be 
independent to ensure that CPU utilization is correctly 
accounted” [26]. Indeed, FreeBSD performs the billing 
activity (second item in Section 1.1) independently of the 
other tick activities (notably timing), at different times 
and in a different frequency. But while this design alle- 
viates some of the concerns raised by Liedtke [30] and 
largely eliminates the behavior we observed [15], it is 
nonetheless helpless against a cheat attack that factors 
this design in (Section 5) and only highlights the lack of 
awareness to the possibility of systematic cheating. 

Solaris designers noticed that “CPU usage measure- 
ments aren’t as accurate as you may think ... especially 
at low usage levels”, namely, a process that consumes lit- 
tle CPU “could be sneaking a bite of CPU time whenever 
the clock interrupt isn’t looking” and thus “appear to use 
1% of the system but in fact use 5%” [10]. The billing 
error was shown to match the inverse of the CPU utiliza- 
tion (which is obviously not the case when cheating, as 
CPU utilization and the billing error are in fact equal). 

Windows XP employs a “partial decay” mechanism, 
proclaiming that without it “it would be possible for 
threads never to have their quantums reduced; for ex- 
ample, a thread ran, entered a wait state, ran again, and 
entered another wait state but was never the currently 
running thread when the clock interval timer fired” [44]. 
Like in the FreeBSD case, partial decay is useless against 
a cheat attack (Section 5), but in contrast, it doesn’t even 
need to be specifically addressed, reemphasizing the elu- 
siveness of the problem. 

We contend, however, that all the above can be consid- 
ered as anecdotal evidence of the absence of awareness to 
cheat attacks, considering the bottom line, which is that 
all widely used ticking operating systems are susceptible 
to the attack, and have been that way for years.” 


3 Implementation and Evaluation 


As outlined above, the cheat attack exploits the combina- 
tion of two operating system mechanisms: periodic sam- 
pling to account for CPU usage, and prioritization of pro- 
cesses that use less of the CPU. The idea is to avoid the 
accounting and then enjoy the resulting high priority. We 
next detail how the former is achieved. 


2We conceived the attack a few years after [15], as a result of a dis- 
pute between PhD students regarding who gets to use the departmental 
compute clusters for simulations before some approaching deadlines. 
We eventually did not exercise the attack to resolve the dispute, ex- 
cept for the experiment described in Section 4.1, which was properly 
authorized by the system personnel. 
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Figure 2: The cheat attack is based on a scenario where a pro- 
cess starts running immediately after one clock tick, but stops 
before the next tick, so as not to be billed. 


3.1 Using the CPU Without Being Billed 


When a tick (= periodic hardware clock interrupt) occurs, 
the entire interval since the previous tick is billed to the 
application that ran just before the current tick occurred. 
This mechanism usually provides reasonably accurate 
billing, despite the coarse tick granularity of a few mil- 
liseconds and the fact that nowadays the typical quanta 
is much shorter, for many applications [15].* This is a 
result of the probabilistic nature of the sampling: Since 
a large portion of the quanta are shorter than one clock 
tick, and the scheduler can only count in complete tick 
units, many of the quanta are not billed at all. But when 
a short quantum does happen to include a clock interrupt, 
the associated application is overbilled and charged a full 
tick. Hence, on average, these two effects tend to cancel 
out, because the probability that a quantum includes a 
tick is proportional to its duration. 

Fig. 2 outlines how this rationale is circumvented. The 
depicted scenario has two components: (1) start running 
after a given billing sample, and (2) stop before the next. 
Implementing the first component is relatively easy, as 
both the billing and the firing of pending alarm timers are 
done upon a tick (first and second items in Section 1.1; 
handling the situation where the two items are indepen- 
dent, as in FreeBSD, is deferred to Section 5). Con- 
sequently, if a process blocks on a timer, it will be re- 
leased just after a billing sample. And in particular, set- 
ting a very short timer interval (e.g. zero nanoseconds) 
will wake a process up immediately after the very next 
tick. If in addition it will have high priority, as is the case 
when the OS believes it is consistently sleeping, it will 
also start to run. 

The harder part is to stop running before the next tick, 
when the next billing sample occurs. This may happen 
by chance as described above in relation to Xine and X. 
The question is how to do this on purpose. Since the 
OS does not provide intra-tick timing services, the pro- 
cess needs some sort of a finer-grained alternative timing 


3Tn this context, quantum is defined to be the duration between the 
time an application was allocated the CPU and the time in which it 
relinquished the CPU, either voluntary or due to preemption. 
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inline cycle_t get_cycle 


cycle _t ret 
am volatile 
return ret 
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nano leep 
cycle_t 


ero, 
tart 
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cycle _t wor , tic _ tart, now 
wor fraction cycle _per_tic 
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tic _ tart get_cycle 
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nano leep ero, // avoid bill 
tic _ tart get_cycle 
// do ome hort wor here... 


Figure 3: The complete code for the cheater process (cycle_t is typedef-ed to be an unsigned 64-bit integer). 


mechanism. This can be constructed with the help of 
the cycle counter, available in all contemporary architec- 
tures. The counter is readable from user level using an 
appropriate assembly instruction, as in the get_cycles 
function (Fig. 3, top/left) for the Pentium architecture. 


The next step is figuring out the interval between each 
two consecutive clock ticks, in cycles. This can be done 
by a routine such as cycles_per_tick (Fig. 3 bottom/left), 
correctly assuming a zero sleep would wake it up at the 
next clock interrupt, and averaging the duration of a thou- 
sand ticks. While this was sufficient for our purposes, a 
more precise method would be to tabulate all thousand 
timestamps individually, calculate the intervals between 
them, and exclude outliers that indicate some activity in- 
terfered with the measurement. Alternatively, the data 
can be deduced from various OS-specific information 
sources, e.g. by observing Linux’s /proc/interrupts file 
(reveals the OS tick frequency) and /proc/cpuinfo (pro- 
cessor frequency). 


It is now possible to write an application that uses any 
desired fraction of the available CPU cycles, as in the 
cheat_attack function (Fig. 3, right). This first calcu- 
lates the number of clock cycles that constitute the de- 
sired percentage of the clock tick interval. It then iter- 
ates doing its computation, while checking whether the 
desired limit has been reached at each iteration. When 
the limit is reached, the application goes to sleep for 
zero time, blocking till after the next tick. The only 
assumption is that the computation can be broken into 
small pieces, which is technically always possible to do 
(though in Section 4 we further show how to cheat with- 
out this assumption). This solves the problem of know- 
ing when to stop to avoid being billed. As a result, this 
non-privileged application can commandeer any desired 
percentage of the CPU resources, while looking as if it is 
using zero resources. 


3.2 Experimental Results 


To demonstrate that this indeed works as described, we 
implemented such an application and ran it on a 2.8GHz 
Pentium-IV, running a standard Linux-2.6.16 system de- 
fault installation with the usual daemons, and no other 
user processes except our tests. The application didn’t 
do any useful work — it just burned cycles. At the 
same time we also ran another compute-bound applica- 
tion, that also just burned cycles. An equitable scheduler 
should have given each about 50% of the machine. But 
the cheat application was set to use 80%, and got them. 


During the execution of the two competing applica- 
tions, we monitored every important event in the system 
(such as interrupts and context switches) using the Klog- 
ger tool [17]. A detailed rendition of precisely what hap- 
pened is given in Fig. 4. This shows 10 seconds of exe- 
cution along the X axis, at tick resolution. As the system 
default tick rate is 250 Hz, each tick represents 4ms. To 
show what happens during each tick, we spread those 
4ms along the Y axis, and use color coding. Evidently, 
the cheat application is nearly always the first one to run 
(on rare occasions some system daemon runs initially for 
a short time). But after 3.2ms (that is, exactly 80% of the 
tick) it blocks, allowing the honest process or some other 
process to run. 

Fig. 5 scatter-plots the billing accuracy, where each 
point represents one quantum. With accurate accounting 
we would have seen a diagonal, but this is not the case. 
While the cheat process runs for just over 3ms each time, 
it is billed for 0 (bottom right disk). The honest process, 
on the other hand, typically runs for less than Ims, but is 
billed for 4 (top left); on rare occasions it runs for nearly 
a whole tick, due to some interference that caused the 
cheater to miss one tick (top right); the cheater neverthe- 
less recovers at the following tick. The other processes 
run for a very short time and are never billed. 
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Figure 4: Timeline of 10 seconds of competition between a cheat and hon- 


est processes. Legends give the distribution of CPU cycles. 
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Figure 5: Billing accuracy achieved during the 
test shown in Fig. 4. 
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Figure 6: Snippet of the output of the top utility for user dants (the full output includes dozens of processes, and the cheater 
appears near the end and is hard to notice). The honest process is billed for 99.3% of the CPU, while actually getting only 20%. 
The cheater looks as if it is not getting any CPU, while it actually consumes 80%. 


The poor accounting information propagates to system 
usage monitoring tools. Like any monitoring utility, the 
view presented to the user by top is based on OS billing 
information, and presents a completely distorted picture 
as can be seen in Fig. 6. This dump was taken about 
8 seconds into the run, and indeed the honest process is 
billed for 99.3% of the CPU and is reported as having run 
for 7.79 seconds. The cheater on the other hand is shown 
as using 0 time and 0% of the CPU. Moreover, it is re- 
ported as being suspended (status S), further throwing off 
any system administrator that tries to understand what is 
going on. As a result of the billing errors, the cheater has 
the highest priority (lowest numerical value: 15), which 
allows it to continue with its exploits. 


Our demonstration used a setting of 80% for the cheat 
application (a 0.8 fraction argument to the cheat_attack 
function in Fig. 3). But other values can be used. Fig. 7 
shows that the attack is indeed very accurate, and can 
achieve precisely the desired level of usage. Thus, an at- 
tacker that wants to keep a low profile can set the cheat- 
ing to be a relatively small value (e.g. 15%); the chances 
users will notice this are probably very slim. 


Finally, our demonstration have put forth only one 
competitor against the cheater. But the attack is in 


fact successful regardless of the number of competitors 
that form the background load. This is demonstrated in 
Fig. 8: An honest process (left) gets its equal share of 
the CPU, which naturally becomes smaller and smaller 
as more competing processes are added. For example, 
when 5 processes are present, each gets 20%. In con- 
trast, when the process is cheating (right) it always gets 
what it wants, despite the growing competition. The rea- 
son of course is that the cheater has very a high priority, 
as it appears as consuming no CPU cycles, which implies 
an immediate service upon wakeup. 


4 Running Unmodified Applications 


A potential drawback of the above design is that it re- 
quires modifying the application to incorporate the cheat 
code. Ideally, from an attacker’s perspective, there 
should be a “cheat” utility such that invoking e.g. 


cheat application 


would execute the application as a 95%-cheater, without 
having to modify and recompile its code. This section 
describes two ways to implement such a tool. 
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Figure 7: The attack is very accurate and 
the cheater gets exactly the amount of CPU 
cycles it requested. 


4.1 Cheat Server 


The sole challenge a cheat application faces is obtaining 
a timing facility that is finer than the one provided by 
the native OS. Any such facility would allow the cheater 
to systematically block before ticks occur, insuring it is 
never billed and hence the success of the attack. In the 
previous section, this was obtained by subdividing the 
work into short chunks and consulting the cycle counter 
at the end of each. A possible alternative is obtaining 
the required service using an external machine, namely, 
a cheat server. 

The idea is very simple. Using a predetermined cheat 
protocol, a client opens a connection and requests the 
remote server to send it a message at some designated 
high-resolution time, before the next tick on the local 
host occurs. (The request is a single UDP packet spec- 
ifying the required interval in nanoseconds; the content 
of the response message is unimportant.) The client then 
polls the connection to see if a message arrived, instead 
of consulting the cycle counter. Upon the message ar- 
rival, the client as usual sleeps-zero, wakes up just after 
the next tick, sends another request to the server, and so 
on. The only requirement is that the server would indeed 
be able to provide the appropriate timing granularity. But 
this can be easily achieved if the server busy-waits on its 
cycle counter, or if its OS is compiled with a relatively 
high tick rate (the alternative we preferred). 

By switching the fine-grained timing source — from 
the cycle counter to the network — we gain one impor- 
tant advantage: instead of polling, we can now sleep-wait 
for the event to occur, e.g. by using the select system 
call. This allows us to divide the cheater into two sepa- 
rate entities: the target application, which is the the un- 
modified program we want to run, and the cheat client, 
which is aware of the cheat protocol, provisions the tar- 
get application, and makes sure it sleeps while ticks oc- 
cur. The client exercises its control by using the standard 
SIGSTOP/SIGCONT signals, as depicted in Fig. 9: 





number of competing processes 


Figure 8: Cheating is immune to the background load. 








































































































































































































program }{fork/exec cheat client cheat server 
[default; 250Hz] [10,000Hz] 
k—| SIGSTOP 
sleep zero 
——: i+ SIGCONT request msg 80%} 
select T 
3s 
fe 3 
: i 
k— SIGSTOP send msg 
sleep zero 
+ —_ SIGCONT request msg 80%} 
ale select ey 
ae 
| +l SIGSTOP send msg 
| sleep zero 
— =.—,_ SIGCONT request msg 80%|—+ 
Select 











Figure 9: The cheat protocol, as used by a 80%-cheater. 


1. The client forks the target application, sends it a 
stop signal, and goes to sleep till the next tick. 


2. Awoken on a tick, the client does the following: 
(a) It sends the cheat server a request for a timing 
message including the desired interval. 


(b) It sends the target application a cont signal to 
wake it up. 

(c) It blocks-waiting for the message from the 
cheat server to arrive. 


3. As the cheat client blocks, the operating system will 
most probably dispatch the application that was just 
unblocked (because it looks as if it is always sleep- 
ing, and therefore has high priority). 


4. At the due time, the cheat server sends its message 
to the cheat client. This causes a network interrupt, 
and typically the immediate scheduling of the cheat 
client (which also looks like a chronic sleeper). 


5. The cheat client now does two things: 
(a) It sends the target application a stop signal to 
prevent it from being billed 


(b) It goes to sleep-zero, till the next (local) tick. 


6. Upon the next tick, it will resume from step 2. 
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Figure 10: The combined throughput of honest vs. 60%- 
cheating processes, as a function of the number of cluster nodes 
used. On each node there are ten honest processes and one 
cheater running. The cheaters’ throughput indicates that the 
server simultaneously provides good service to all clients. 


To demonstrate that this works, we have implemented 
this scenario, hijacking a shared departmental cluster of 
Pentium-IV machines. As a timing server we used an old 
Pentium-III machine, with a Linux 2.6 system clocked 
at 10,000 Hz. While such a high clock rate adds over- 
head [15], this was acceptable since the timing server 
does not have any other responsibilities. In fact, it could 
easily generate the required timing messages for the full 
cluster size, which was 32 cheat clients in our case, as 
indicated by Fig. 10. 


4.2 Binary Instrumentation 


Using a cheat server is the least wasteful cheating method 
in terms of throughput loss, as it avoids all polling. The 
drawback however is the resulting network traffic that 
can be used to detect the attack, and the network latency 
which is now a factor to consider (observe the cheaters’ 
throughput in Fig. 10 that is higher than requested). Ad- 
ditionally, it either requires a dedicated machine to host 
the server (if it busy-waits to obtain the finer resolution) 
or the ability to install a new kernel (if resolution is ob- 
tained through higher tick rate). Finally, the server con- 
stitutes a single point of failure. 

Binary instrumentation of the target application is 
therefore an attractive alternative, potentially providing 
a way to turn an arbitrary program into a cheater, requir- 
ing no recompilation and avoiding the drawbacks of the 
cheat-server design. The idea is to inject the cheating 
code directly into the executable, instead of explicitly in- 
cluding it in the source code. To quickly demonstrate the 
viability of this approach we used Pin, a dynamic binary 
instrumentation infrastructure from Intel [33], primarily 
used for analysis tasks as profiling and performance eval- 
uation. Being dynamic, Pin is similar to a just-in-time 
(JIT) compiler that allows attaching analysis routines to 
various pieces of the native machine code upon the first 
time they are executed. 


void cheat_analy i 


cycle tec get_cycle 


142 ¢ tic tart 


nano leep ero, 
tic _ tart get_cycle 
Figure 11: The injected cheat “analysis” routine. (The 


WORK macro expends to the number of cycles that reflect the 
desired cheat fraction; the tick_start global variable is initial- 
ized beforehand to hold the beginning of the tick in which the 
application was started.) 


The routine we used is listed in Fig. 11. Invoking it 
often enough would turn any application into a cheater. 
The question is where exactly to inject this code, and 
what is the penalty in terms of performance. The answer 
to both questions is obviously dependent on the instru- 
mented application. For the purpose of this evaluation, 
we chose to experiment with an event-driven simulator of 
a supercomputer scheduler we use as the basis of many 
research effort [39, 19, 18, 51]. Aside from initially read- 
ing an input log file (containing a few years worth of par- 
allel jobs’ arrival times, runtimes, etc.), the simulator is 
strictly CPU intensive. The initial part is not included in 
our measurements, so as not to amortize the instrumen- 
tation overhead and overshadow its real cost by hiding it 
within more expensive I/O operations. 


Fig. 12 shows the slowdown that the simulator ex- 
perienced as a function of the granularity of the injec- 
tion. In all cases the granularity was fine enough to 
turn the simulator into a full fledged cheater. Instru- 
menting every machine instruction in the program in- 
curs a slowdown of 123, which makes sense because 
this is approximately the duration of cheat_analysis in 
cycles. This is largely dominated by the rdtsc opera- 
tion (read time-stamp counter; wrapped by get_cycles), 
which takes about 90 cycles. The next grain size is a ba- 
sic block, namely, a single-entry single-exit instructions 
sequence (containing no branches). In accordance to the 
well known assertion that “the average basic block size 
is around four to five instructions” [56], it incurs a slow- 
down of of 25, which is indeed a fifth of the slowdown 
associated with instructions. A trace of instructions (as- 
sociated with the hardware trace-cache) is defined to be 
a single-entry multiple exits sequence of basic blocks 
that may be separated spatially, but are adjacent tempo- 
rally [43]. This grain size further reduces the slowdown 
to 15. Instrumenting at the coarser function level brings 
us to a slowdown factor of 3.6, which is unfortunately 
still far from optimal. 
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Figure 12: The overheads incurred by cheat-instrumentation, 
as function of the granularity in which the cheat_analysis rou- 
tine is injected. The Y axis denotes the slowdown penalty due 
to the instrumentation, relative to the runtime obtained when 
no instrumentation takes place (“none”’). 


The average instructions-per-function number within a 
simulator run, is very small (about 35), the result of mul- 
tiple abstraction layers within the critical path of execu- 
tion. This makes the function-granularity inappropriate 
for injecting the cheat code to our simulator, when at- 
tempting to turn it into an efficient cheater. Furthermore, 
considering the fact that nowadays a single tick consists 
of millions of cycles (about 11 millions on the platform 
we used, namely, a 2.8 GHz Pentium-IV at 250 Hz tick 
rate), a more adequate grain size for the purpose of cheat- 
ing would be, say, tens of thousands of cycles. Thus, a 
slightly more sophisticated approach is required. Luck- 
ily, simple execution profiling (using Pin or numerous 
other tools) quickly reveal where an application spends 
most of its time; in the case of our simulator this was 
within two functions, the one that pops the next event to 
simulate and the one that searches for the next parallel 
job to schedule. By instructing Pin to instrument only 
these two functions, we were able to turn the simulator 
into a cheater, while reducing the slowdown penalty to 
less than 2% of the baseline. We remark that even though 
this selective instumentation process required our man- 
ual intervention, we believe it reflects a fairly straight- 
forward and simple methodology that can probably be 
automated with some additional effort. 


Finally, note that all slowdowns were computed with 
respect to the runtime of a simulator that was not instru- 
mented, but still executed under Pin. This was done so as 
not to pollute our evaluation with unrelated Pin-specific 
performance issues. Indeed, running the simulator na- 
tively is 45% faster than the Pin baseline, a result of Pin 
essentially being a virtual machine [33]. Other binary 
instrumentation methods, whether static [48], exploiting 
free space within the executable itself [41], or linking to 
it loadable modules [4], do not suffer from this deficiency 
and are expected to close the gap. 
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Figure 13: Throughput of a 80%-cheater competing against 
an honest process, under the operating systems with which we 
experimented. These measurements were executed on the fol- 
lowing OS versions: Linux 2.4.32, Linux 2.6.16, Window XP 
SP2, Solaris 10 (SunOS 5.10 for i386), and FreeBSD 6.1. 


5 General-Purpose Schedulers 


The results shown so far are associated with Linux- 
2.6. To generalize, we experimented with other tick- 
ing operating systems and found that they are all sus- 
ceptible to the cheat attack. The attack implementation 
was usually as shown in Fig. 3, possibly replacing the 
nanosleep with a call to pause, which blocks on a re- 
peated tick-resolution alarm-signal that was set before- 
hand using setitimer (all functions are standard POSIX; 
the exceptions was Windows XP, for which we used 
Win32’s GetMessage for blocking). Fig. 13 shows the 
outcome of repeating the experiment described in Sec- 
tion 3 (throughput of simultaneously running a 80%- 
cheater against an honest process) under the various OSs. 

Our high level findings were already detailed in Sec- 
tion 1.4 and summarized in Fig. 1. In this section we 
describe in more detail the design features that make 
schedulers vulnerable to cheating; importantly, we ad- 
dress the “partial quantum decay” mechanism of Win- 
dows XP and provide more details regarding FreeBSD, 
which separates billing from timing activity and requires 
a more sophisticated cheating approach. 


5.1 Multilevel Feedback Queues 


Scheduling in all contemporary general-purpose operat- 
ing systems is based on a multilevel feedback queue. The 
details vary, but roughly speaking, the priority is a com- 
bination of a static component (“nice” value), and a dy- 
namic component that mostly reflects lack of CPU us- 
age, interpreted as being “I/O bound”; processes with 
the highest priority are executed in a round-robin man- 
ner. As the cheat process avoids billing, it gets the high- 
est priority and hence can monopolize the CPU. This is 
what makes cheating widely applicable. 
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A potential problem with the multilevel feedback 
queues is that processes with lower priorities might 
starve. OSs employ various policies to avoid this. For ex- 
ample, Linux 2.4 uses the notion of “epochs” [36]. Upon 
a new epoch, the scheduler allocates a new quantum to 
all processes, namely, allows them to run for an addi- 
tional 60ms. The epoch will not end until all runnable 
processes have exhausted their allocation, insuring all 
of them get a chance to run before new allocations are 
granted. Epochs are initiated by the tick handler, as part 
of the third item in Section 1.1. The remaining time a 
process has to run in the current epoch also serves as its 
priority (higher values imply higher priority). Schedul- 
ing decisions are made only when the remaining alloca- 
tion of the currently running process reaches zero (pos- 
sibly resulting in a new epoch if no runnable processes 
with positive allocation exist), or when a blocked process 
is made runnable. 

This design would initially seem to place a limit on the 
fraction of cycles hijacked by the cheater. However, as 
always, cheating works because of the manner Linux-2.4 
rewards sleepers: upon a new epoch, a currently blocked 
process gets to keep half of its unused allocation, in ad- 
dition to the default 60ms. As a cheater is never billed 
and always appears blocked when a tick takes place, its 
priority quickly becomes }>* ) 60-2~* = 120 (the max- 
imum possible), which means it is always selected to run 
when it unblocks. 

In Solaris, the relationship between the priority and 
the allocated quantum goes the other way [35]. When a 
thread is inserted to the run-queue, a table is used to al- 
locate its new priority and quantum (which are two sep- 
arate things here) based on its previous priority and the 
reason it is inserted into the queue — either because its 
time quantum expired or because it just woke up after 
blocking. The table is designed such that processes that 
consume their entire allocation receive even longer al- 
locations, but are assigned lower priorities. In contrast, 
threads that block or sleep are allocated higher priori- 
ties, but shorter quanta. By avoiding billing the cheater 
is considered a chronic sleeper that never runs, causing 
its priority to increase until it reaches the topmost prior- 
ity available. The short-quanta allocation restriction is 
circumvented, because the scheduler maintains its own 
(misguided) CPU-accounting based on sampling ticks. 


5.2 Prioritization For Interactivity 


An obvious feature of the Linux 2.4 and Solaris schemes 
is that modern interactive processes (as games or movie 
players that consume a lot of CPU cycles) will end up 
having low priority and will be delayed as they wait for 
all other processes to complete their allocations. This is 
an inherent feature of trying to equally partition the pro- 


cessor between competing processes. Linux 2.6 there- 
fore attempts to provide special treatment to processes it 
identifies as interactive by maintaining their priority high 
and by allowing them to continue to execute even if their 
allocation runs out, provided other non-interactive pro- 
cesses weren’t starved for too long [32]. A similar mech- 
anism is used in the ULE scheduler on FreeBSD [42]. 

In both systems, interactive processes are identified 
based on the ratio between the time they sleep and the 
time they run, with some weighting of their relative influ- 
ence. If the ratio passes a certain threshold, the process is 
deemed interactive. This mechanism plays straight into 
the hands of the cheater process: as it consistently ap- 
pears sleeping, it is classified interactive regardless of 
the specific value of the ratio. The anti-starvation mech- 
anism is irrelevant because other processes are allowed 
to run at the end of each tick when the cheater sleeps. 
Thus, cheating would have been applicable even in the 
face of completely accurate CPU accounting. (The same 
observation holds for Windows XP, as described in the 
next subsection.) 

We contend that the interactivity weakness manifested 
by the above is the result of two difficulties. The first is 
how to identify multimedia applications, which is cur- 
rently largely based on their observed sleep and CPU 
consumption patterns. This is a problematic approach: 
our cheater is an example of a “false positive” it might 
yield. In a related work we show that this problem is 
inherent, namely, that typical CPU-intensive interactive 
application can be paired with non-interactive applica- 
tions that have identical CPU consumption patterns [14]. 
Thus, it is probably impossible to differentiate between 
multimedia applications and others based on such crite- 
ria, and attempts to do so are doomed to fail; we argue 
that the solution lies in directly monitoring how applica- 
tions interact with devices that are of interest to human 
users [16]. 

The second difficulty is how to schedule a process 
identified as being interactive. The problem here arises 
from the fact that multimedia applications often have 
both realtime requirements (of meeting deadlines) and 
significant computational needs. Such characteristics are 
incompatible with the negative feedback of “running re- 
duces priority to run more”, which forms the basis of 
the classic general-purpose scheduling [5] (as in Linux 
2.4, Solaris, and FreeBSD/4BSD) and only delivers fast 
response times to applications that require little CPU. 
Linux-2.6, FreeBSD/ULE, and Windows XP tackled this 
problem in a way that compromises the system. And 
while it is possible to patch this to a certain extent, we 
argue that any significant divergence from the aforemen- 
tioned negative-feedback design necessitates a much bet- 
ter notion of what is important to users than can ever be 
inferred solely from CPU consumption patterns [14, 16]. 





250 


16th USENIX Security Symposium 


USENIX Association 


5.3 Partial Quantum Decay 


In Windows XP, the default time quantum on a work- 
station or server is 2 or 12 timer ticks, respectively, with 
the quantum itself having a value of “6” (3 x 2) or “36” 
(3 x 12), implying that every clock tick decrements the 
quantum by 3 units [44]. The reason a quantum is stored 
internally in terms of a multiple of 3 units per tick rather 
than as single units is to allow for “partial quantum de- 
cay”. Specifically, each waiting thread is charged one 
unit upon wakeup, so as to prevent situations in which 
a thread avoids billing just because it was asleep when 
the tick occurred. Hence, the cheater loses a unit upon 
each tick. Nonetheless, this is nothing but meaningless 
in comparison to what it gains due its many sleep events. 

After a nonzero wait period (regardless of how short), 
Windows XP grants the awakened thread a “priority 
boost” by moving it a few steps up within the multi- 
level feedback queue hierarchy, relative to its base prior- 
ity. Generally, following a boost, threads are allowed to 
exhaust their full quantum, after which they are demoted 
one queue in the hierarchy, allocated another quantum, 
and so forth until they reach their base priority again. 
This is sufficient to allow cheating, because a cheater is 
promoted immediately after being demoted (as it sleeps 
on every tick). Thus, it consistently maintains a higher 
position relative to the “non-boosted” threads and there- 
fore always gets the CPU when it awakes. By still allow- 
ing others to run at the end of each tick, it prevents the 
anti-starvation mechanism from kicking in. 

Note that this is true regardless of whether the billing 
is accurate or not, which means XP suffers from the in- 
teractivity weakness as Linux 2.6 and FreeBSD/ULE. 
To make things even worse, “in the case where a wait 
is not satisfied immediately” (as for cheaters), “its [the 
thread’s] quantum is reset to a full turn” [44], rendering 
the partial quantum decay mechanism (as any hypotheti- 
cal future accurate billing) completely useless. 


5.4 Dual Clocks 


Compromising FreeBSD, when configured to use its 
4BSD default scheduler [37], required us to revisit the 
code given in Fig. 3. Noticing that timer-oriented appli- 
cations often tend to synchronize with ticks and start to 
run immediately after they occur, the FreeBSD design- 
ers decided to separate the billing from the timing activ- 
ity [26]. Specifically, FreeBSD uses two timers with rel- 
atively prime frequencies — one for interrupts in charge 
of driving regular kernel timing events (with frequency 
HZ), and one for gathering billing statistics (with fre- 
quency STATHZ). A running thread’s time quantum is 
decremented by | every STATHZ tick. The test sys- 


timer interrupts 


Case 1 





HZ ticks 
Case 2 





Case 3 STATHZ ticks 





Figure 14: The three possible alignments of the two FreeBSD 
clocks: no STATHZ tick between consecutive HZ ticks 
(case 1), STATHZ ticks falls on an even timer interrupt along- 
side a HZ tick (case 2), and a STATHZ tick falling on an odd 
clock interrupt between HZ ticks (case 3). 


tem that was available to us runs with a HZ frequency 
of 1000Hz and STATHZ frequency of ~133Hz. 

Both the HZ and STATHZ timers are derived from a 
single timer interrupt, configured to fire at a higher fre- 
quency of 2 x HZ = 2000Hz. During each timer inter- 
rupt the handler checks whether the HZ and/or STATHZ 
tick handlers should be called — the first is called ev- 
ery 2 interrupts, whereas the second is called every 15- 
16 interrupts (~ ae). The possible alignments of the 
two are shown in Fig. 14. The HZ ticks are executed 
on each even timer interrupt (case 1). Occasionally the 
HZ and STATHZ ticks align on an even timer interrupt 
(case 2), and sometimes STATHZ is executed on an odd 
timer interrupt (case 3). By avoiding HZ ticks we also 
avoid STATHZ ticks in case 2. But to completely avoid 
being billed for the CPU time it consumes, the cheater 
must identify when case 3 occurs and sleep between the 
two consecutive HZ tick surrounding the STATHZ tick. 

The kernel’s timer interrupt handler calculates when 
to call the HZ and STATHZ ticks in a manner which re- 
aligns the two every second. Based on this, we mod- 
ified the code in Fig. 3 to pre-compute a 2 x HZ sized 
STATHZ-bitmap, in which each bit corresponds to a spe- 
cific timer interrupt in a one second interval, and set- 
ting the bit for those interrupts which drive a STATHZ 
tick. Further, the code reads the number of timer inter- 
rupts that occurred since the system was started, avail- 
able through a sysctl call. The cheater then requests the 
system for signals at a constant HZ rate. The signal han- 
dler in turn accesses the STATHZ bitmap with a value of 
(interrupt_index + 1) mod (2x HZ) to check whether 
the next timer interrupt will trigger a STATHZ tick. This 
mechanism allows the cheater thread to identify case 3 
and simply sleep until the next HZ tick fires. The need 
to occasionally sleep for two full clock ticks slightly re- 
duces the achievable throughput, as indicated in Fig. 13. 
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6 Protecting Against the Cheat Attack 


6.1 Degrees of Accuracy 


While all ticking OSs utilize information that is exclu- 
sively based on sampling for the purpose of schedul- 
ing, some operating system also maintain precise CPU- 
usage information (namely, Solaris and Windows XP). 
Under this design, each kernel entry/exit is accompanied 
by reading the cycle counter to make the kernel aware 
of how many cycles were consumed by the process thus 
far, as well as to provide the user/kernel usage statistics. 
(Incidentally this also applies to the one-shot Mac OS 
X.) Solaris, provides even finer statistics by saving the 
time a thread spends in each of the thread states (running, 
blocked, etc.). While such consistently accurate informa- 
tion can indeed be invaluable in various contexts, it does 
not come without a price. 

Consider for example the per system call penalty. 
Maintaining user/kernel statistics requires that (at least) 
the following would be added to the system call invo- 
cation path: two rdtsc operations (of reading the cycle 
counter at kernel entry and exit), subtracting of the as- 
sociated values, and adding the difference to some accu- 
mulator. On our Pentium-IV 2.8GHz this takes ~200 
cycles (as each rdtsc operation takes ~90 cycles and 
the arithmetics involves 64bit integers on a 32bit ma- 
chine). This penalty is significant relative to the duration 
of short system calls, e.g. on our system, sigoprocmask 
takes ~430/1020 cycles with an invalid/valid argument, 
respectively, implying 20-47% of added overhead. 

Stating a similar case, Liedtke argued against this type 
of kernel fine-grained accounting [30], and indeed the as- 
sociated overheads may very well be the reason why sys- 
tems like Linux and FreeBSD do not provide such a ser- 
vice. It is not our intent to express an opinion on the mat- 
ter, but rather, to make the tradeoff explicit and to high- 
light the fact that designers need not face it when protect- 
ing against cheat attacks. Specifically, there is no need to 
know exactly how many cycles were consumed by a run- 
ning process upon each kernel entry (and user/kernel or 
finer statistics are obviously irrelevant too). The sched- 
uler would be perfectly happy with a much lazier ap- 
proach: that the information would be updated only upon 
a context switch. This is a (1) far less frequent and a (2) 
far more expensive event in comparison to a system call 
invocation, and therefore the added overhead of reading 
the cycle counter is made relatively negligible. 


6.2 Patching the Kernel 


We implemented this “lazy” perfect-billing patch within 
the Linux 2.6.16 kernel. It is only a few dozen lines 
long. The main modification is in the task_struct struc- 


ture to replace the time_slice field that counts down 
a process’ CPU allocation in a resolution of “jiffies” 
(the Linux term for clock ticks). It is replaced by 
two fields: ns_time_slice, which counts down the al- 
located time slice in nanoseconds instead of jiffies, and 
ns_last_update, which records when ns_time_slice was 
last updated. The value of ns_time-_slice is decremented 
by the elapsed time since ns_last_update, in two places: 
on each clock tick (this simply replaces the original 
time-_slice jiffy decrement, but with the improvement 
of only accounting for cycles actually used by this pro- 
cess), and from within the schedule function just before 
a context switch (this is the new part). The rest of the 
kernel is unmodified, and still works in a resolution of 
jiffies. This was done by replacing accesses to time_slice 
with an inlined function that wraps ns_time-slice and 
rounds it to jiffies. 


Somewhat surprisingly, using this patch did not solve 
the cheat problem: a cheat process that was trying to ob- 
tain 80% of the cycles still managed to get them, despite 
the fact that the scheduler had full information about this 
(Fig. 15). As explained in Section 5, this happened be- 
cause of the extra support for “interactive” processes in- 
troduced in the 2.6 kernel. The kernel identifies pro- 
cesses that yield a lot as interactive, provided their nice 
level is not too high. When an “interactive” process ex- 
hausts its allocation and should be moved from the “‘ac- 
tive array” into the “expired array’, it is nevertheless al- 
lowed to remain in the active array, as long as already ex- 
pired processes are not starved (they’re not: the cheater 
runs less than 100% of the time by definition, and thus 
the Linux anti-starvation mechanism is useless against 
it). In effect, the scheduler is overriding its own quanta 
allocations; this is a good demonstration of the two sides 
of cheating prevention: it is not enough to have good in- 
formation — it is also necessary to use it effectively. 


In order to rectify the situation, we disallowed pro- 
cesses to circumvent their allocation by commenting out 
the line that reinserts expired “interactive” processes to 
the active array. As shown in Fig. 16, this has finally 
succeeded to defeat the attack. The timeline is effec- 
tively divided into epochs of 200ms (corresponding to 
the combined duration of the two 100ms time-slices of 
the two competing processes) in which the processes 
share the CPU equitably. While the “interactive” cheater 
has higher priority (as its many block events gains it a 
higher position in the multilevel queue hierarchy), this is 
limited to the initial part of the epoch, where the cheater 
repeatedly gets to run first upon each tick. However, after 
~125ms of which the cheater consumes 80%, its alloca- 
tion runs out (125ms-=%=100ms). It is then moved to 
the expired array and its preferential treatment is tem- 
porarily disabled. The honest process is now allowed to 
catch up and indeed runs for ~75ms until it too exhausts 
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Figure 15: In Linux 2.6, cheating is still possible even with perfect billing (compare with Figs. 4-5). 


its quantum and is removed from the active array, leav- 
ing it empty. At this point the expired/active array are 
swapped and the whole thing is repeated. 

The above exposes a basic tradeoff inherent to prior- 
itizing based on CPU consumption patterns: one must 
either enforce a relatively equitable distribution of CPU 
cycles, or be exposed to attacks by cheaters that can eas- 
ily emulate “interactive” behavior. (We note in passing 
that processes with +19 nice value are never regarded as 
interactive by the 2.6 kernel, so the “optimization” that 
allows interactive processes to deprive the others is ef- 
fectively disabled; see right table in Fig. 15.) 

Finally, let us discuss the patch overheads. The 
schedule function was ~80 cycles (=5%) slower: 
1636 + 182 cycles on average instead of 1557 + 159 
without the patch (+ denotes standard deviation). At the 
same time, the overhead of a tick handler (the sched- 
uler_tick function) was reduced by 17%, from 8439 + 
9323 to 6971 + 9506. This is probably due to the fact 
that after the patch, the cheater ran much less, and there- 
fore generated a lot less timers for the handler to process. 
Note that these measurements embody the direct over- 
head only (does not include traps to the kernel and back, 
nor cache pollution due to the traps or context switches). 
Also note that as the high standard deviations indicate, 
the distribution of ticks has a long tail, with maximal 
values around 150,000 cycles. Lastly, the patch did not 
affect the combined throughput of the processes, at all. 

















6.3 Other Potential Solutions 


Several solutions may be used to prevent cheating appli- 
cations from obtaining excessive CPU resources. Here 
we detail some of them, and explain why they are infe- 
rior to the accurate billing we suggested above. Perhaps 
the simplest solution is to charge for CPU usage up-front, 
when a process is scheduled to run, rather than relying 
on sampling of the running process. However, this will 
overcharge interactive processes that in fact do not use 
much CPU time. Another potential solution is to use 
two clocks, but have the billing clock operate at a finer 
resolution than the timer clock. This leads to two prob- 


lems. One is that it requires a very high tick rate, which 
leads to excessive overhead. The other is that it does not 
completely eliminate the cheat attack. An attack is still 
possible using an extension of the cheat server approach 
described in Section 4. The extension is that the server 
is used not only to stop execution, but also to start it. A 
variant of this is to randomize the clock in order to make 
it impossible for an attacker to predict when ticks will 
occur as suggested by Liedtke in relation to user/kernel 
statistics [30]. This can work, but at the cost of overheads 
and complexity. Note however that true randomness is 
hard to come by, and it has already been shown that 
a system’s random number generator could be reverse- 
engineered in order to beat the randomness [24]. A third 
possible approach is to block access to the cycle counter 
from user level (this is possible at least on the Intel ma- 
chines). This again suffers from two problems. First, it 
withdraws a service that may have good and legitimate 
uses. Second, it too does not eliminate the cheat attack, 
only make it somewhat less accurate. A cheat application 
can still be written without access to a cycle counter by 
finding approximately how much application work can 
be done between ticks, and using this directly to decide 
when to stop running. 


6.4 A Note About Sampling 


In the system domain, it is often tempting to say “let us 
do this chore periodically”. It is simple and easy and 
therefore often the right thing to do. But if the chore is 
somehow related to accounting or safeguarding a system, 
and if “periodically” translates to “can be anticipated”, 
then the design might be vulnerable. This observation 
is hardly groundbreaking. However, as with ticks, we 
suspect it is often brushed aside for the sake of simplicity. 
Without any proof, we now list a few systems that may 
posses this vulnerability. 

At a finer granularity than ticks, one can find Cisco’s 
NetFlow router tool that “preforms 1 in N periodic [non- 
probabilistic] sampling” [13] (possibly allowing an ad- 
versary to avoid paying for his traffic). At coarser gran- 
ularity is found the per-node infod of the MOSIX cluster 
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Figure 16: Cheating is eliminated when expired processes are not reinserted to the active list (compare with Fig. 15). 


infrastructure [7], which wakes up every 5 seconds to 
charge processes that migrated to the node (work can be 
partitioned to shorter processes). The FAQ of IBM’s in- 
ternal file infrastructure called GSA (Global Storage Ar- 
chitecture) states that “charges will be based on daily file 
space snapshots” [22] (raising the possibility of a well- 
timed mv between two malicious cooperating users). 
And finally, the US Army MDARS (Mobile Detection 
Assessment Response System) patrol robots that “stop 
periodically during their patrols to scan for intruders us- 
ing radar and infrared sensors” in search of moving ob- 
jects [45] again raise the question of what exactly does 
“periodically” mean. 


7 Conclusions 


The “cheat” attack is a simple way to exploit computer 
systems. It allows an unprivileged user-level application 
to seize whatever fraction of the CPU cycles it wants, 
often in a secretive manner. The cycles used by the 
cheater are attributed to some other innocent applica- 
tion or simply unaccounted for, making the attack hard 
to detect. Such capabilities are typically associated with 
rootkits that, in contrast, require an attacker to obtain su- 
peruser privileges. We have shown that all major general- 
purpose systems are vulnerable to the attack, with the 
exception of Mac OS X that utilizes one-shot timers to 
drive its timing mechanism. 

Cheating is based on two dominant features of 
general-purpose systems: that CPU accounting and timer 
servicing are tied to periodic hardware clock interrupts, 
and that the scheduler favors processes that exhibit low 
CPU usage. By systematically sleeping when the inter- 
rupts occur, a cheater appears as not consuming CPU 
and is therefore rewarded with a consistent high priority, 
which allows it to monopolize the processor. 

The first step to protect against the cheat attack is to 
maintain accurate CPU usage information. This is al- 


CPU usage before a context switch occurs, we achieve 
sufficient accuracy in a manner more suitable for systems 
like Linux and FreeBSD that are unwilling to pay the as- 
sociated overhead of the Solaris/Windows way. Once the 
information is available, the second part of the solution 
is to incorporate it within the scheduling subsystem (So- 
laris and XP don’t do that). 

The third component is to use the information judi- 
ciously. This is not an easy task, as indicated by the fail- 
ure of Windows XP, Linux 2.6, and FreeBSD/ULE to do 
so, allowing a cheater to monopolize the CPU regardless 
of whether accurate information is used for scheduling 
or not. In an attempt to better support the ever increasing 
CPU-intensive multimedia component within the desk- 
top workload, these systems have shifted to prioritizing 
processes based on their sleep-events frequency, instead 
of duration. This major departure from the traditional 
general-purpose scheduler design [5] plays straight into 
the hands of cheaters, which can easily emulate CPU- 
usage patterns that multimedia applications exhibit. A 
safer alternative would be to explicitly track user interac- 
tions [14, 16]. 
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Abstract 


We are entering the multi-core era in computer science. 
All major high-performance processor manufacturers have in- 
tegrated at least two cores (processors) on the same chip — 
and it is predicted that chips with many more cores will be- 
come widespread in the near future. As cores on the same chip 
share the DRAM memory system, multiple programs execut- 
ing on different cores can interfere with each others’ memory 
access requests, thereby adversely affecting one another’s per- 
formance. 

In this paper, we demonstrate that current multi-core proces- 
sors are vulnerable to a new class of Denial of Service (DoS) 
attacks because the memory system is “unfairly” shared among 
multiple cores. An application can maliciously destroy the 
memory-related performance of another application running on 
the same chip. We call such an application a memory perfor- 
mance hog (MPH). With the widespread deployment of multi- 
core systems in commodity desktop and laptop computers, we 
expect MPHs to become a prevalent security issue that could 
affect almost all computer users. 

We show that an MPH can reduce the performance of an- 
other application by 2.9 times in an existing dual-core system, 
without being significantly slowed down itself; and this prob- 
lem will become more severe as more cores are integrated on 
the same chip. Our analysis identifies the root causes of unfair- 
ness in the design of the memory system that make multi-core 
processors vulnerable to MPHs. As a solution to mitigate the 
performance impact of MPHs, we propose a new memory sys- 
tem architecture that provides fairness to different applications 
running on the same chip. Our evaluations show that this mem- 
ory system architecture is able to effectively contain the neg- 
ative performance impact of MPHs in not only dual-core but 
also 4-core and 8-core systems. 


1 Introduction 


For many decades, the performance of processors has in- 
creased by hardware enhancements (increases in clock 
frequency and smarter structures) that improved single- 
thread (sequential) performance. In recent years, how- 
ever, the immense complexity of processors as well 
as limits on power-consumption has made it increas- 
ingly difficult to further enhance single-thread perfor- 
mance [18]. For this reason, there has been a paradigm 


shift away from implementing such additional enhance- 
ments. Instead, processor manufacturers have moved on 
to integrating multiple processors on the same chip in 
a tiled fashion to increase system performance power- 
efficiently. In a multi-core chip, different applications 
can be executed on different processing cores concur- 
rently, thereby improving overall system throughput 
(with the hope that the execution of an application on 
one core does not interfere with an application on an- 
other core). Current high-performance general-purpose 
computers have at least two processors on the same chip 
(e.g. Intel Pentium D and Core Duo (2 processors), Intel 
Core-2 Quad (4), Intel Montecito (2), AMD Opteron (2), 
Sun Niagara (8), IBM Power 4/5 (2)). And, the industry 
trend is toward integrating many more cores on the same 
chip. In fact, Intel has announced experimental designs 
with up to 80 cores on chip [16]. 

The arrival of multi-core architectures creates signif- 
icant challenges in the fields of computer architecture, 
software engineering for parallelizing applications, and 
operating systems. In this paper, we show that there are 
important challenges beyond these areas. In particular, 
we expose a new security problem that arises due to the 
design of multi-core architectures — a Denial-of-Service 
(DoS) attack that was not possible in a traditional single- 
threaded processor.! We identify the “security holes” 
in the hardware design of multi-core systems that make 
such attacks possible and propose a solution that miti- 
gates the problem. 

In a multi-core chip, the DRAM memory system is 
shared among the threads concurrently executing on dif- 
ferent processing cores. The way current DRAM mem- 
ory systems work, it is possible that a thread with a 
particular memory access pattern can occupy shared re- 
sources in the memory system, preventing other threads 
from using those resources efficiently. In effect, the 


While this problem could also exist in SMP (symmetric shared- 
memory multiprocessor) and SMT (simultaneous multithreading) sys- 
tems, it will become much more prevalent in multi-core architectures 
which will be widespreadly deployed in commodity desktop, laptop, 
and server computers. 
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memory requests of some threads can be denied service 
by the memory system for long periods of time. Thus, 
an aggressive memory-intensive application can severely 
degrade the performance of other threads with which it 
is co-scheduled (often without even being significantly 
slowed down itself). We call such an aggressive appli- 
cation a Memory Performance Hog (MPH). For exam- 
ple, we found that on an existing dual-core Intel Pentium 
D system one aggressive application can slow down an- 
other co-scheduled application by 2.9X while it suffers 
a slowdown of only 18% itself. In a simulated 16-core 
system, the effect is significantly worse: the same ap- 
plication can slow down other co-scheduled applications 
by 14.6X while it slows down by only 4.4X. This shows 
that, although already severe today, the problem caused 
by MPHs will become much more severe as processor 
manufacturers integrate more cores on the same chip in 
the future. 

There are three discomforting aspects of this novel se- 
curity threat: 


e First, an MPH can maliciously destroy the memory- 
related performance of other programs that run on 
different processors on the same chip. Such Denial 
of Service in a multi-core memory system can ulti- 
mately cause significant discomfort and productiv- 
ity loss to the end user, and it can have unforeseen 
consequences. For instance, an MPH (perhaps writ- 
ten by a competitor organization) could be used to 
fool computer users into believing that some other 
applications are inherently slow, even without caus- 
ing easily observable effects on system performance 
measures such as CPU usage. Or, an MPH can result 
in very unfair billing procedures on grid-like com- 
puting systems where users are charged based on 
CPU hours [9].? With the widespread deployment 
of multi-core systems in commodity desktop, laptop, 
and server computers, we expect MPHs to become a 
much more prevalent security issue that could affect 
almost all computer users. 

e Second, the problem of memory performance attacks 
is radically different from other, known attacks on 
shared resources in systems, because it cannot be 
prevented in software. The operating system or the 
compiler (or any other application) has no direct con- 
trol over the way memory requests are scheduled in 
the DRAM memory system. For this reason, even 
carefully designed and otherwise highly secured sys- 
tems are vulnerable to memory performance attacks, 
unless a solution is implemented in memory system 


7In fact, in such systems, some users might be tempted to rewrite 
their programs to resemble MPHs so that they get better performance 
for the price they are charged. This, in turn, would unfairly slow down 
co-scheduled programs of other users and cause other users to pay 
much higher since their programs would now take more CPU hours. 


hardware itself. For example, numerous sophisti- 
cated software-based solutions are known to prevent 
DoS and other attacks involving mobile or untrusted 
code (e.g. [10, 25, 27, 5, 7]), but these are unsuited 
to prevent our memory performance attacks. 

e Third, while an MPH can be designed intentionally, a 
regular application can unintentionally behave like an 
MPH and damage the memory-related performance 
of co-scheduled applications, too. This is discomfort- 
ing because an existing application that runs with- 
out significantly affecting the performance of other 
applications in a single-threaded system may deny 
memory system service to co-scheduled applications 
in a multi-core system. Consequently, critical appli- 
cations can experience severe performance degrada- 
tions if they are co-scheduled with a non-critical but 
memory-intensive application. 


The fundamental reason why an MPH can deny memory 
system service to other applications lies in the “unfair- 
ness” in the design of the multi-core memory system. 
State-of-the-art DRAM memory systems service mem- 
ory requests on a First-Ready First-Come-First-Serve 
(FR-FCFS) basis to maximize memory bandwidth uti- 
lization [30, 29, 23]. This scheduling approach is suit- 
able when a single thread is accessing the memory sys- 
tem because it maximizes the utilization of memory 
bandwidth and is therefore likely to ensure fast progress 
in the single-threaded processing core. However, when 
multiple threads are accessing the memory system, ser- 
vicing the requests in an order that ignores which thread 
generated the request can unfairly delay some thread’s 
memory requests while giving unfair preference to oth- 
ers. As a consequence, the progress of an application 
running on one core can be significantly hindered by an 
application executed on another. 


In this paper, we identify the causes of unfairness in 
the DRAM memory system that can result in DoS attacks 
by MPHs. We show how MPHs can be implemented and 
quantify the performance loss of applications due to un- 
fairness in the memory system. Finally, we propose a 
new memory system design that is based on a novel def- 
inition of DRAM fairness. This design provides memory 
access fairness across different threads in multi-core sys- 
tems and thereby mitigates the impact caused by a mem- 
ory performance hog. 


The major contributions we make in this paper are: 


e We expose a new Denial of Service attack that 
can significantly degrade application performance on 
multi-core systems and we introduce the concept of 
Memory Performance Hogs (MPHs). An MPH is an 
application that can destroy the memory-related per- 
formance of another application running on a differ- 
ent processing core on the same chip. 
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e We demonstrate that MPHs are a real problem by 
evaluating the performance impact of DoS attacks on 
both real and simulated multi-core systems. 

We identify the major causes in the design of the 

DRAM memory system that result in DoS attacks: 

hardware algorithms that are unfair across different 

threads accessing the memory system. 

e We describe and evaluate a new memory system de- 
sign that provides fairness across different threads 
and mitigates the large negative performance impact 
of MPHs. 


2 Background 


We begin by providing a brief background on multi- 
core architectures and modern DRAM memory systems. 
Throughout the section, we abstract away many details 
in order to give just enough information necessary to 
understand how the design of existing memory systems 
could lend itself to denial of service attacks by explicitly- 
malicious programs or real applications. Interested read- 
ers can find more details in [30, 8, 41]. 


2.1 Multi-Core Architectures 


Figure | shows the high-level architecture of a process- 
ing system with one core (single-core), two cores (dual- 
core) and N cores (N-core). In our terminology, a “core” 
includes the instruction processing pipelines (integer and 
floating-point), instruction execution units, and the LI 
instruction and data caches. Many general-purpose com- 
puters manufactured today look like the dual-core sys- 
tem in that they have two separate but identical cores. 
In some systems (AMD Athlon/Turion/Opteron, Intel 
Pentium-D), each core has its own private L2 cache, 
while in others (Intel Core Duo, IBM Power 4/5) the L2 
cache is shared between different cores. The choice of a 
shared vs. non-shared L2 cache affects the performance 
of the system [19, 14] and a shared cache can be a pos- 
sible source of vulnerability to DoS attacks. However, 
this is not the focus of our paper because DoS attacks at 
the L2 cache level can be easily prevented by providing 
a private L2 cache to each core (as already employed by 
some current systems) or by providing “quotas” for each 
core in a shared L2 cache [28]. 

Regardless of whether or not the L2 cache is shared, 
the DRAM Memory System of current multi-core sys- 
tems is shared among all cores. In contrast to the L2 
cache, assigning a private DRAM memory system to 
each core would significantly change the programming 
model of shared-memory multiprocessing, which is com- 
monly used in commercial applications. Furthermore, 
in a multi-core system, partitioning the DRAM memory 
system across cores (while maintaining a shared-memory 
programming model) is also undesirable because: 


1. DRAM memory is still a very expensive resource 
in modern systems. Partitioning it requires more 
DRAM chips along with a separate memory con- 
troller for each core, which significantly increases the 
cost of a commodity general-purpose system, espe- 
cially in future systems that will incorporate tens of 
cores on chip. 

2. In a partitioned DRAM system, a processor access- 
ing a memory location needs to issue a request to the 
DRAM partition that contains the data for that loca- 
tion. This incurs additional latency and a communi- 
cation network to access another processor’s DRAM 
if the accessed address happens to reside in that par- 
tition. 

For these reasons, we assume in this paper that each core 

has a private L2 cache but all cores share the DRAM 

memory system. We now describe the design of the 

DRAM memory system in state-of-the-art systems. 


2.2 DRAM Memory Systems 


A DRAM memory system consists of three major com- 
ponents: (1) the DRAM banks that store the actual data, 
(2) the DRAM controller (scheduler) that schedules com- 
mands to read/write data from/to the DRAM banks, and 
(3) DRAM address/data/command buses that connect the 
DRAM banks and the DRAM controller. 


2.2.1 DRAM Banks 


A DRAM memory system is organized into multiple 
banks such that memory requests to different banks can 
be serviced in parallel. As shown in Figure 2 (left), each 
DRAM bank has a two-dimensional structure, consisting 
of multiple rows and columns. Consecutive addresses in 
memory are located in consecutive columns in the same 
row.> The size of a row varies, but it is usually between 
1-8Kbytes in commodity DRAMs. In other words, in a 
system with 32-byte L2 cache blocks, a row contains 32- 
256 L2 cache blocks. 

Each bank has one row-buffer and data can only be read 
from this buffer. The row-buffer contains at most a sin- 
gle row at any given time. Due to the existence of the 
row-buffer, modern DRAMsS are not truly random access 
(equal access time to all locations in the memory array). 
Instead, depending on the access pattern to a bank, a 
DRAM access can fall into one of the three following 
categories: 


1. Row hit: The access is to the row that is already in 
the row-buffer. The requested column can simply 
be read from or written into the row-buffer (called 
a column access). This case results in the lowest 
latency (typically 30-50ns round trip in commodity 


3Note that consecutive memory rows are located in different banks. 
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Figure 1: High-level architecture of an example single-core system (left), a dual-core system (middle), and an N-core 
system (right). The chip is shaded. The DRAM memory system, part of which is off chip, is encircled. 
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Figure 2: Left: Organization of a DRAM bank, Right: Organization of the DRAM controller 


DRAM, including data transfer time, which trans- 
lates into 90-150 processor cycles for a core run- 
ning at 3GHz clock frequency). Note that sequen- 
tial/streaming memory access patterns (e.g. accesses 
to cache blocks A, A+1, A+2, ...) result in row hits 
since the accessed cache blocks are in consecutive 
columns in a row. Such requests can therefore be 
handled relatively quickly. 


. Row conflict: The access is to a row different from 


the one that is currently in the row-buffer. In this 
case, the row in the row-buffer first needs to be writ- 
ten back into the memory array (called a row-close) 
because the row access had destroyed the row’s data 
in the memory array. Then, a row access is per- 
formed to load the requested row into the row-buffer. 
Finally, a column access is performed. Note that this 
case has much higher latency than a row hit (typically 
60-100ns or 180-300 processor cycles at 3GHz). 


. Row closed: There is no row in the row-buffer. Due 


to various reasons (e.g. to save energy), DRAM 
memory controllers sometimes close an open row in 
the row-buffer, leaving the row-buffer empty. In this 
case, the required row needs to be first loaded into the 
row-buffer (called a row access). Then, a column ac- 
cess is performed. We mention this third case for the 


sake of completeness because in the paper, we focus 
primarily on row hits and row conflicts, which have 
the largest impact on our results. 


Due to the nature of DRAM bank organization, sequen- 
tial accesses to the same row in the bank have low latency 
and can be serviced at a faster rate. However, sequen- 
tial accesses to different rows in the same bank result in 
high latency. Therefore, to maximize bandwidth, current 
DRAM controllers schedule accesses to the same row in 
a bank before scheduling the accesses to a different row 
even if those were generated earlier in time. We will later 
show how this policy causes unfairness in the DRAM 
system and makes the system vulnerable to DoS attacks. 


2.2.2 DRAM Controller 


The DRAM controller is the mediator between the on- 
chip caches and the off-chip DRAM memory. It re- 
ceives read/write requests from L2 caches. The addresses 
of these requests are at the granularity of the L2 cache 
block. Figure 2 (right) shows the architecture of the 
DRAM controller. The main components of the con- 
troller are the memory request buffer and the memory ac- 
cess scheduler. 

The memory request buffer buffers the requests re- 
ceived for each bank. It consists of separate bank request 
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buffers. Each entry in a bank request buffer contains the 
address (row and column), the type (read or write), the 
timestamp, and the state of the request along with stor- 
age for the data associated with the request. 

The memory access scheduler is the brain of the mem- 
ory controller. Its main function is to select a memory 
request from the memory request buffer to be sent to 
DRAM memory. It has a two-level hierarchical orga- 
nization as shown in Figure 2. The first level consists of 
separate per-bank schedulers. Each bank scheduler keeps 
track of the state of the bank and selects the highest- 
priority request from its bank request buffer. The second 
level consists of an across-bank scheduler that selects the 
highest-priority request among all the requests selected 
by the bank schedulers. When a request is scheduled by 
the memory access scheduler, its state is updated in the 
bank request buffer, and it is removed from the buffer 
when the request is served by the bank (For simplicity, 
these control paths are not shown in Figure 2). 


2.2.3 Memory Access Scheduling Algorithm 


Current memory access schedulers are designed to max- 
imize the bandwidth obtained from the DRAM memory. 
As shown in [30], a simple request scheduling algorithm 
that serves requests based on a first-come-first-serve pol- 
icy is prohibitive, because it incurs a large number of 
row conflicts. Instead, current memory access schedulers 
usually employ what is called a First-Ready First-Come- 
First-Serve (FR-FCFS) algorithm to select which request 
should be scheduled next [30, 23]. This algorithm prior- 
itizes requests in the following order in a bank: 


1. Row-hit-first: A bank scheduler gives higher prior- 
ity to the requests that would be serviced faster. In 
other words, a request that would result in a row hit 
is prioritized over one that would cause a row con- 
flict. 

2. Oldest-within-bank-first: A bank scheduler gives 
higher priority to the request that arrived earliest. 


Selection from the requests chosen by the bank sched- 
ulers is done as follows: 

Oldest-across-banks-first: The across-bank DRAM 

bus scheduler selects the request with the earliest arrival 
time among all the requests selected by individual bank 
schedulers. 
In summary, this algorithm strives to maximize DRAM 
bandwidth by scheduling accesses that cause row hits 
first (regardless of when these requests have arrived) 
within a bank. Hence, streaming memory access patterns 
are prioritized within the memory system. The oldest 
row-hit request has the highest priority in the memory 
access scheduler. In contrast, the youngest row-conflict 
request has the lowest priority. 


2.3 Vulnerability of the Multi-Core DRAM 
Memory System to DoS Attacks 


As described above, current DRAM memory systems do 
not distinguish between the requests of different threads 
(i.e. cores)*. Therefore, multi-core systems are vulnera- 
ble to DoS attacks that exploit unfairness in the memory 
system. Requests from a thread with a particular access 
pattern can get prioritized by the memory access sched- 
uler over requests from other threads, thereby causing 
the other threads to experience very long delays. We find 
that there are two major reasons why one thread can deny 
service to another in current DRAM memory systems: 


1. Unfairness of row-hit-first scheduling: A thread 
whose accesses result in row hits gets higher priority 
compared to a thread whose accesses result in row 
conflicts. We call an access pattern that mainly re- 
sults in row hits as a pattern with high row-buffer lo- 
cality. Thus, an application that has a high row-buffer 
locality (e.g. one that is streaming through memory) 
can significantly delay another application with low 
row-buffer locality if they happen to be accessing the 
same DRAM banks. 

2. Unfairness of oldest-first scheduling: Oldest-first 
scheduling implicitly gives higher priority to those 
threads that can generate memory requests at a faster 
rate than others. Such aggressive threads can flood 
the memory system with requests at a faster rate than 
the memory system can service. As such, aggres- 
sive threads can fill the memory system’s buffers with 
their requests, while less memory-intensive threads 
are blocked from the memory system until all the 
earlier-arriving requests from the aggressive threads 
are serviced. 


Based on this understanding, it is possible to develop a 
memory performance hog that effectively denies service 
to other threads. In the next section, we describe an ex- 
ample MPH and show its impact on another application. 


3 Motivation: Examples of Denial of Mem- 
ory Service in Existing Multi-Cores 


In this section, we present measurements from real sys- 
tems to demonstrate that Denial of Memory Service at- 
tacks are possible in existing multi-core systems. 


3.1 Applications 


We consider two applications to motivate the problem. 
One is a modified version of the popular stream bench- 
mark [21], an application that streams through memory 
and performs operations on two one-dimensional arrays. 
The arrays in stream are sized such that they are much 


4We assume, without loss of generality, one core can execute one 
thread. 
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inde j 5 // treaming inde 
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// initiali e array a,b 
for j j j 

inde j rand // random in , 
eee 
for j a] j 

a inde j b inde j 
for j j j 

b inde j calar a inde j 
eee 

(b) RDARRAY 


Figure 3: Major loops of the stream (a) and rdarray (b) programs 


larger than the L2 cache on a core. Each array consists of 
2.5M 128-byte elements.” Stream (Figure 3(a)) has very 
high row-buffer locality since consecutive cache misses 
almost always access the same row (limited only by the 
size of the row-buffer). Even though we cannot directly 
measure the row-buffer hit rate in our real experimental 
system (because hardware does not directly provide this 
information), our simulations show that 96% of all mem- 
ory requests in stream result in row-hits. 

The other application, called rdarray, is almost the ex- 
act opposite of stream in terms of its row-buffer locality. 
Its pseudo-code is shown in Figure 3(b). Although it per- 
forms the same operations on two very large arrays (each 
consisting of 2.5M 128-byte elements), rdarray accesses 
the arrays in a pseudo-random fashion. The array indices 
accessed in each iteration of the benchmark’s main loop 
are determined using a pseudo-random number genera- 
tor. Consequently, this benchmark has very low row- 
buffer locality; the likelihood that any two outstanding 
L2 cache misses in the memory request buffer are to the 
same row in a bank is low due to the pseudo-random gen- 
eration of array indices. Our simulations show that 97% 
of all requests in rdarray result in row-conflicts. 


3.2 Measurements 


We ran the two applications alone and together on two 
existing multi-core systems and one simulated future 
multi-core system. 


3.2.1 A Dual-core System 


The first system we examine is an Intel Pentium D 
930 [17] based dual-core system with 2GB SDRAM. 
In this system each core has an L2 cache size of 2MB. 
Only the DRAM memory system is shared between the 
two cores. The operating system is Windows XP Pro- 
fessional.© All the experiments were performed when 


5Even though the elements are 128-byte, each iteration of the main 
loop operates on only one 4-byte integer in the 128-byte element. We 
use 128-byte elements to ensure that consecutive accesses miss in the 
cache and exercise the DRAM memory system. 

We also repeated the same experiments in (1) the same system with 
the RedHat Fedora Core 6 operating system and (2) an Intel Core Duo 
based dual-core system running RedHat Fedora Core 6. We found the 
results to be almost exactly the same as those reported. 


the systems were unloaded as much as possible. To ac- 
count for possible variability due to system state, each 
run was repeated 10 times and the execution time results 
were averaged (error bars show the variance across the 
repeated runs). Each application’s main loop consists of 
N = 2.5 - 10° iterations and was repeated 1000 times in 
the measurements. 

Figure 4(a) shows the normalized execution time of 
stream when run (1) alone, (2) concurrently with another 
copy of stream, and (3) concurrently with rdarray. Fig- 
ure 4(b) shows the normalized execution time of rdarray 
when run (1) alone, (2) concurrently with another copy 
of rdarray, and (3) concurrently with stream. 

When stream and rdarray execute concurrently on the 
two different cores, stream is slowed down by only 18%. 
In contrast, rdarray experiences a dramatic slowdown: 
its execution time increases by up to 190%. Hence, 
stream effectively denies memory service to rdarray 
without being significantly slowed down itself. 

We hypothesize that this behavior is due to the row- 
hit-first scheduling policy in the DRAM memory con- 
troller. As most of stream’s memory requests hit in 
the row-buffer, they are prioritized over rdarray’s re- 
quests, most of which result in row conflicts. Conse- 
quently, rdarray is denied access to the DRAM banks 
that are being accessed by stream until the stream pro- 
gram’s access pattern moves on to another bank. With 
a row size of 8KB and a cache line size of 64B, 128 
(=8KB/64B) of stream’s memory requests can be ser- 
viced by a DRAM bank before rdarray is allowed to ac- 
cess that bank!’ Thus, due to the thread-unfair imple- 
mentation of the DRAM memory system, stream can act 
as an MPH against rdarray. 

Note that the slowdown rdarray experiences when run 


7Note that we do not know the exact details of the DRAM mem- 
ory controller and scheduling algorithm that is implemented in the ex- 
isting systems. These details are not made public in either Intel’s or 
AMD’s documentation. Therefore, we hypothesize about the causes of 
the behavior based on public information available on DRAM memory 
systems - and later support our hypotheses with our simulation infras- 
tructure (see Section 6). It could be possible that existing systems have 
a threshold up to which younger requests can be ordered over older 
requests as described in a patent [33], but even so our experiments 
suggest that memory performance attacks are still possible in existing 
multi-core systems. 
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stream alone with another stream with rdarray 


rdarray alone with another rdarray with stream 


Figure 4: Normalized execution time of (a) stream and (b) rdarray when run alone/together on a dual-core system 


with stream (2.90X) is much greater than the slowdown 
it experiences when run with another copy of rdarray 
(1.71X). Because neither copy of rdarray has good row- 
buffer locality, another copy of rdarray cannot deny ser- 
vice to rdarray by holding on to a row-buffer for a long 
time. In this case, the performance loss comes from in- 
creased bank conflicts and contention in the DRAM bus. 

On the other hand, the slowdown stream experiences 
when run with rdarray is significantly smaller than the 
slowdown it experiences when run with another copy of 
stream. When two copies of stream run together they are 
both able to deny access to each other because they both 
have very high row-buffer locality. Because the rates at 
which both streams generate memory requests are the 
same, the slowdown is not as high as rdarray’s slowdown 
with stream: copies of stream take turns in denying ac- 
cess to each other (in different DRAM banks) whereas 
stream always denies access to rdarray (in all DRAM 
banks). 


3.2.2 A Dual Dual-core System 


The second system we examine is a dual dual-core AMD 
Opteron 275 [1] system with 4GB SDRAM. In this sys- 
tem, only the DRAM memory system is shared between 
a total of four cores. Each core has an L2 cache size 
of 1 MB. The operating system used was RedHat Fe- 
dora Core 5. Figure 5(a) shows the normalized execution 
time of stream when run (1) alone, (2) with one copy of 
rdarray, (3) with 2 copies of rdarray, (4) with 3 copies 
of rdarray, and (5) with 3 other copies of stream. Fig- 
ure 5(b) shows the normalized execution time of rdarray 
in similar but “dual” setups. 

Similar to the results shown for the dual-core Intel sys- 
tem, the performance of rdarray degrades much more 
significantly than the performance of stream when the 
two applications are executed together on the 4-core 
AMD system. In fact, stream slows down by only 48% 
when it is executed concurrently with 3 copies of rdar- 
ray. In contrast, rdarray slows down by 408% when run- 
ning concurrently with 3 copies of stream. Again, we hy- 
pothesize that this difference in slowdowns is due to the 
row-hit-first policy employed in the DRAM controller. 


3.2.3 A Simulated 16-core System 


While the problem of MPHs is severe even in current 
dual- or dual-dual-core systems, it will be significantly 
aggravated in future multi-core systems consisting of 
many more cores. To demonstrate the severity of the 
problem, Figure 6 shows the normalized execution time 
of stream and rdarray when run concurrently with 15 
copies of stream or 15 copies of rdarray, along with 
their normalized execution times when 8 copies of each 
application are run together. Note that our simulation 
methodology and simulator parameters are described in 
Section 6.1. In a 16-core system, our memory perfor- 
mance hog, stream, slows down rdarray by 14.6X while 
rdarray slows down stream by only 4.4X. Hence, stream 
is an even more effective performance hog in a 16-core 
system, indicating that the problem of “memory perfor- 
mance attacks” will become more severe in the future if 
the memory system is not adjusted to prevent them. 


4 Towards a Solution: Fairness in DRAM 
Memory Systems 


The fundamental unifying cause of the attacks demon- 
strated in the previous section is unfairness in the shared 
DRAM memory system. The problem is that the mem- 
ory system cannot distinguish whether a harmful mem- 
ory access pattern issued by a thread is due to a malicious 
attack, due to erroneous programming, or simply a nec- 
essary memory behavior of a specific application. There- 
fore, the best the DRAM memory scheduler can do is to 
contain and limit memory attacks by providing fairness 
among different threads. 

Difficulty of Defining DRAM Fairness: But what ex- 
actly constitutes fairness in DRAM memory systems? 
As it turns out, answering this question is non-trivial 
and coming up with a reasonable definition is somewhat 
problematic. For instance, simple algorithms that sched- 
ule requests in such a way that memory latencies are 
equally distributed among different threads disregard the 
fact that different threads have different amounts of row- 
buffer locality. As a consequence, such equal-latency 
scheduling algorithms will unduly slow down threads 
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Figure 5: Slowdown of (a) stream and (b) rdarray when run alone/together on a dual dual-core system 
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Figure 6: Slowdown of (a) stream and (b) rdarray when run alone and together on a simulated 16-core system 


that have high row-buffer locality and prioritize threads 
that have poor row-buffer locality. Whereas the standard 
FR-FCFS scheduling algorithm can starve threads with 
poor row-buffer locality (Section 2.3), any algorithm 
seeking egalitarian memory fairness would unfairly pun- 
ish “well-behaving” threads with good row-buffer local- 
ity. Neither of the two options therefore rules out unfair- 
ness and the possibility of memory attacks. 


Another challenge is that DRAM memory systems 
have a notion of state (consisting of the currently 
buffered rows in each bank). For this reason, well- 
studied notions of fairness that deal with stateless sys- 
tems cannot be applied in our setting. In network fair 
queuing [24, 40, 3], for example, the idea is that if N pro- 
cesses share a common channel with bandwidth B, every 
process should achieve exactly the same performance as 
if it had a single channel of bandwidth B/N. When map- 
ping the same notion of fairness onto a DRAM memory 
system (as done in [23]), however, the memory sched- 
uler would need to schedule requests in such a way as 
to guarantee the following: In a multi-core system with 
N threads, no thread should run slower than the same 
thread on a single-core system with a DRAM memory 
system that runs at 1/Nth of the speed. Unfortunately, 
because memory banks have state and row conflicts incur 
a higher latency than row hit accesses, this notion of fair- 
ness is ill-defined. Consider for instance two threads in 
a dual-core system that constantly access the same bank 
but different rows. While each of these threads by itself 
has perfect row-buffer locality, running them together 
will inevitably result in row-buffer conflicts. Hence, it 
is impossible to schedule these threads in such a way 


that each thread runs at the same speed as if it ran by 
itself on a system at half the speed. On the other hand, 
requests from two threads that consistently access differ- 
ent banks could (almost) entirely be scheduled in parallel 
and there is no reason why the memory scheduler should 
be allowed to slow these threads down by a factor of 2. 


In summary, in the context of memory systems, no- 
tions of fairness—such as network fair queuing—that at- 
tempt to equalize the latencies experienced by different 
threads are unsuitable. In a DRAM memory system, it 
is neither possible to achieve such a fairness nor would 
achieving it significantly reduce the risk of memory per- 
formance attacks. In Section 4.1, we will present a novel 
definition of DRAM fairness that takes into account the 
inherent row-buffer locality of threads and attempts to 
balance the “relative slowdowns”. 


The Idleness Problem: In addition to the above ob- 
servations, it is important to observe that any scheme 
that tries to balance latencies between threads runs into 
the risk of what we call the idleness problem. Threads 
that are temporarily idle (not issuing many memory re- 
quests, for instance due to a computation-intensive pro- 
gram phase) will be slowed down when returning to a 
more memory intensive access pattern. On the other 
hand, in certain solutions based on network fair queu- 
ing [23], a memory hog could intentionally issue no or 
few memory requests for a period of time. During that 
time, other threads could “move ahead” at a proportion- 
ally lower latency, such that, when the malicious thread 
returns to an intensive access pattern, it is temporarily 
prioritized and normal threads are blocked. The idleness 
problem therefore poses a severe security risk: By ex- 





264 


16th USENIX Security Symposium 


USENIX Association 


ploiting it, an attacking memory hog could temporarily 
slow down or even block time-critical applications with 
high performance stability requirements from memory. 


4.1 Fair Memory Scheduling: A Model 


As discussed, standard notions of fairness fail in pro- 
viding fair execution and hence, security, when mapping 
them onto shared memory systems. The crucial insight 
that leads to a better notion of fairness is that we need 
to dissect the memory latency experienced by a thread 
into two parts: First, the latency that is inherent to the 
thread itself (depending on its row-buffer locality) and 
second, the latency that is caused by contention with 
other threads in the shared DRAM memory system. A 
fair memory system should—unlike the approaches so 
far—schedule requests in such a way that the second la- 
tency component is fairly distributed, while the first com- 
ponent remains untouched. With this, it is clear why our 
novel notion of DRAM shared memory fairness is based 
on the following intuition: In a multi-core system with 
N threads, no thread should suffer more relative perfor- 
mance slowdown—compared to the performance it gets 
if it used the same memory system by itself—than any 
other thread. Because each thread’s slowdown is thus 
measured against its own baseline performance (single 
execution on the same system), this notion of fairness 
successfully dissects the two components of latency and 
takes into account the inherent characteristics of each 
thread. 

In more technical terms, we consider a measure y; for 
each currently executed thread 7.8 This measure captures 
the price (in terms of relative additional latency) a thread 
7 pays because the shared memory system is used by mul- 
tiple threads in parallel in a multi-core architecture. In 
order to provide fairness and contain the risk of denial of 
memory service attacks, the memory controller should 
schedule outstanding requests in the buffer in such a way 
that the y; values are as balanced as possible. Such a 
scheduling will ensure that each thread only suffers a fair 
amount of additional latency that is caused by the parallel 
usage of the shared memory system. 

Formal Definition: Our definition of the measure y; 
is based on the notion of cumulated bank-latency L; p 
that we define as follows. 


Definition 4.1. For each thread i and bank b, the cumu- 
lated bank-latency L;,) is the number of memory cycles 
during which there exists an outstanding memory request 
by thread i for bank b in the memory request buffer. The 
cumulated latency of a thread L; = > » Li,b is the sum 
of all cumulated bank-latencies of thread 1. 


8The DRAM memory system only keeps track of threads that are 
currently issuing requests. 


The motivation for this formulation of L;,» is best seen 
when considering latencies on the level of individual 
memory requests. Consider a thread 7 and let Re , denote 
the Ath memory request of thread 7 that accesses bank b. 
Each such request RE » 18 associated with three specific 


times: Its arrival time ak , when it is entered into the re- 


quest buffer; its finish time fre, when it is completely 
serviced by the bank and sent to processor 7’s cache; and 
finally, the request’s activation time 


st, = max{ Py xy 


This is the earliest time when request Re , could be 
scheduled by the bank scheduler. It is the larger of its 
arrival time and the finish time of the previous request 
ne that was issued by the same thread to the same 
bank. A request’s activation time marks the point in time 
from which on RE , is responsible for the ensuing latency 


of thread 7; before sk p> the request was either not sent 
to the memory system or an earlier request to the same 
bank by the same thread was generating the latency. With 
these definitions, the amortized latency ue , of request 


Re , 1s the difference between its finish time and its acti- 


vation time, 1.e., ey = Pe — BF By the definition of the 


activation time s*’,, it is clear that at any point in time, 
the amortized latency of exactly one outstanding request 
is increasing (if there is at least one in the request buffer). 
Hence, when describing time in terms of executed mem- 
ory cycles, our definition of cumulated bank-latency L;,, 
corresponds exactly to the sum over all amortized laten- 
cies to this bank, i.e., Li, = >>, C*,. 

In order to compute the experienced slowdown of each 
thread, we compare the actual experienced cumulated la- 
tency L; of each thread 1 to an imaginary, ideal single- 
core cumulated latency L; that serves as a baseline. This 
latency L,; is the minimal cumulated latency that thread 
z would have accrued if it had run as the only thread in 
the system using the same DRAM memory; it captures 
the latency component of L; that is inherent to the thread 
itself and not caused by contention with other threads. 
Hence, threads with good and bad row-buffer locality 
have small and large L;, respectively. The measure y; 
that captures the relative slowdown of thread 7 caused by 
multi-core parallelism can now be defined as follows. 


Definition 4.2. For a thread i, the DRAM memory slow- 
down index x; is the ratio between its cumulated latency 
L;, and its ideal single-core cumulated latency L;:? 


°Notice that our definitions do not take into account the service and 
waiting times of the shared DRAM bus and across-bank scheduling. 
Both our definition of fairness as well as our algorithm presented in 
Section 5 can be extended to take into account these and other more 
subtle hardware issues. As the main goal of this paper point out and 
investigate potential security risks caused by DRAM unfairness, our 
model abstracts away numerous aspects of secondary importance be- 
cause our definition provides a good approximation. 
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Finally, we define the DRAM unfairness WV of a 
DRAM memory system as the ratio between the maxi- 
mum and minimum slowdown index over all currently 
executed threads in the system: 

es pa Xi 
min; Xj 
The “ideal” DRAM unfairness index YW = 1 is achieved 
if all threads experience exactly the same slowdown; the 
higher W, the more unbalanced is the experienced slow- 
down of different threads. The goal of a fair memory ac- 
cess scheduling algorithm is therefore to achieve a W that 
is as close to 1 as possible. This ensures that no thread 
is Over-proportionally slowed down due to the shared na- 
ture of DRAM memory in multi-core systems. 

Notice that by taking into account the different row- 
buffer localities of different threads, our definition of 
DRAM unfairness prevents punishing threads for hav- 
ing either good or bad memory access behavior. Hence, 
a scheduling algorithm that achieves low DRAM un- 
fairness mitigates the risk that any thread in the sys- 
tem, regardless of its bank and row access pattern, is 
unduly bogged down by other threads. Notice further 
that DRAM unfairness is virtually unaffected by the idle- 
ness problem, because both cumulated latencies L; and 
ideal single-core cumulated latencies L; are only accrued 
when there are requests in the memory request buffer. 

Short-Term vs. Long-Term Fairness: So far, the as- 
pect of time-scale has remained unspecified in our def- 
inition of DRAM-unfairness. Both L; and L; continue 
to increase throughout the lifetime of a thread. Conse- 
quently, a short-term unfair treatment of a thread would 
have increasingly little impact on its slowdown index 
xi. While still providing long-term fairness, threads that 
have been running for a long time could become vulnera- 
ble to short-term DoS attacks even if the scheduling algo- 
rithm enforced an upper bound on DRAM unfairness WV. 
In this way, delay-sensitive applications could be blocked 
from DRAM memory for limited periods of time. 

We therefore generalize all our definitions to include 
an additional parameter T' that denotes the time-scale for 
which the definitions apply. In particular, L;(T') and 
L;(T) are the maximum (ideal single-core) cumulated 
latencies over all time-intervals of duration 7 during 
which thread i is active. Similarly, y;(T) and U(T) are 
defined as the maximum values over all time-intervals 
of length T’. The parameter T' in these definitions deter- 
mines how short- or long-term the considered fairness is. 
In particular, a memory scheduling algorithm with good 
long term fairness will have small U(T’) for large T, but 
possibly large U(T"’) for smaller T’. In view of the se- 
curity issues raised in this paper, it is clear that a mem- 
ory scheduling algorithm should aim at achieving small 
W(T) for both small and large T. 


5 Our Solution 


In this section, we propose FairMem, a new fair memory 
scheduling algorithm that achieves good fairness accord- 
ing to the definition in Section 4 and hence, reduces the 
risk of memory-related DoS attacks. 


5.1 Basic Idea 


The reason why MPHs can exist in multi-core systems 
is the unfairness in current memory access schedulers. 
Therefore, the idea of our new scheduling algorithm is 
to enforce fairness by balancing the relative memory- 
related slowdowns experienced by different threads. The 
algorithm schedules requests in such a way that each 
thread experiences a similar degree of memory-related 
slowdown relative to its performance when run alone. 

In order to achieve this goal, the algorithm maintains 
a value (xy; in our model of Section 4.1) that character- 
izes the relative slowdown of each thread. As long as all 
threads have roughly the same slowdown, the algorithm 
schedules requests using the regular FR-FCFS mecha- 
nism. When the slowdowns of different threads start di- 
verging and the difference exceeds a certain threshold 
(i.e., when WY becomes too large), however, the algo- 
rithm switches to an alternative scheduling mechanism 
and starts prioritizing requests issued by threads experi- 
encing large slowdowns. 


5.2. Fair Memory Scheduling Algorithm 
(FairMem) 


The memory scheduling algorithm we propose for use 
in DRAM controllers for multi-core systems is defined 
by means of two input parameters, a and (3. These pa- 
rameters can be used to fine-tune the involved trade-offs 
between fairness and throughput on the one hand (a) 
and short-term versus long-term fairness on the other 
(@). More concretely, a is a parameter that expresses 
to what extent the scheduler is allowed to optimize for 
DRAM throughput at the cost of fairness, i.e., how much 
DRAM unfairness is tolerable. The parameter ( corre- 
sponds to the time-interval T' that denotes the time-scale 
of the above fairness condition. In particular, the mem- 
ory controller divides time into windows of duration 3 
and, for each thread maintains an accurate account of 
its accumulated latencies L;() and L; (3) in the current 
time window.!° 


10Notice that in principle, there are various possibilities of interpret- 
ing the term “current time window.” The simplest way is to completely 
reset L;(@) and L;(() after each completion of a window. More so- 
phisticated techniques could include maintaining multiple, say k, such 
windows of size ( in parallel, each shifted in time by G/k memory 
cycles. In this case, all windows are constantly updated, but only the 
oldest is used for the purpose of decision-making. This could help in 
reducing volatility. 
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Instead of using the (FR-FCFS) algorithm described 
in Section 2.2.3, our algorithm first determines two can- 
didate requests from each bank b, one according to each 
of the following rules: 


e Highest FR-FCFS priority: Let Rrr-rcrs be the re- 
quest to bank 6 that has the highest priority according 
to the FR-FCFS scheduling policy of Section 2.2.3. 
That is, row hits have higher priority than row con- 
flicts, and—sgiven this partial ordering—the oldest re- 
quest is served first. 

Highest fairness-index: Let i’ be the thread with 
highest current DRAM memory slowdown index 
xi (3) that has at least one outstanding request in the 
memory request buffer to bank b. Among all requests 
to b issued by 2’, let Reair be the one with highest FR- 
FCFS priority. 


Between these two candidates, the algorithm chooses the 
request to be scheduled based on the following rule: 


e Fairness-oriented Selection: Let \¢(3) and x,(@) 
denote largest and smallest DRAM memory slow- 
down index of any request in the memory request 
buffer for a current time window of duration (3. If 
it holds that 

xe(9) 


Xs(() 
then Rpair is selected by bank b’s scheduler and 
Rrr-rcrs otherwise. 





Instead of using the oldest-across-banks-first strategy as 
used in current DRAM memory schedulers, selection 
from requests chosen by the bank schedulers is handled 
as follows: 

Highest-DRAM-fairness-index-first across banks: 
The request with highest slowdown index y;() among 
all selected bank-requests is sent on the shared DRAM 
bus. 

In principle, the algorithm is built to ensure that at no 
time DRAM unfairness U(() exceeds the parameter a. 
Whenever there is the risk of exceeding this threshold, 
the memory controller will switch to a mode in which it 
starts prioritizing threads with higher y; values, which 
decreases x;. It also increases the x; values of threads 
that have had little slowdown so far. Consequently, this 
strategy balances large and small slowdowns, which de- 
creases DRAM unfairness and—as shown in Section 6— 
keeps potential memory-related DoS attacks in check. 

Notice that this algorithm does not—in fact, cannot— 
guarantee that the DRAM unfairness V does stay below 
the predetermined threshold a at all times. The impos- 
sibility of this can be seen when considering the corner- 
case a = 1. In this case, a violation occurs after the 
first request regardless of which request is scheduled by 
the algorithm. On the other hand, the algorithm always 
attempts to keep the necessary violations to a minimum. 


Another advantage of our scheme is that an approxi- 
mate version of it lends itself to efficient implementation 
in hardware. Finally, notice that our algorithm is robust 
with regard to the idleness problem mentioned in Sec- 
tion 4. In particular, neither L; nor L; is increased or de- 
creased if a thread has no outstanding memory requests 
in the request buffer. Hence, not issuing any requests for 
some period of time (either intentionally or unintention- 
ally) does not affect this or any other thread’s priority in 
the buffer. 


5.3. Hardware Implementations 


The algorithm as described so far is abstract in the sense 
that it assumes a memory controller that always has full 
knowledge of every active (currently-executed) thread’s 
£;, and L;. In this section, we show how this exact 
scheme could be implemented, and we also briefly dis- 
cuss a more efficient practical hardware implementation. 

Exact Implementation: Theoretically, it is possible 
to ensure that the memory controller always keeps accu- 
rate information of L;(3) and L;(3). Keeping track of 
L;(@) for each thread is simple. For each active thread, 
a counter maintains the number of memory cycles dur- 
ing which at least one request of this thread is buffered 
for each bank. After completion of the window ( (or 
when a new thread is scheduled on a core), counters are 
reset. The more difficult part of maintaining an accurate 
account of [;(3) can be done as follows: At all times, 
maintain for each active thread i and for each bank the 
row that would currently be in the row-buffer if 7 had 
been the only thread using the DRAM memory system. 
This can be done by simulating an FR-FCFS priority 
scheme for each thread and bank that ignores all requests 
issued by threads other than 7. The a; latency of each 
request RE , then corresponds to the latency this request 
would have caused if DRAM memory was not shared. 
Whenever a request is served, the memory controller can 
add this “ideal latency” to the corresponding L; »(@) of 
that thread and-if necessary—update the simulated state 
of the row-buffer accordingly. For instance, assume that 
a request RE , 1s served, but results in a row conflict. As- 
sume further that the same request would have been a 
row hit, if thread 2 had run by itself, ie., Apes accesses 


the same row as Be In this case, Li3(8) is increased 
by row-hit latency 7},;:, whereas L,,(@) is increased by 
the bank-conflict latency T.o,¢. By thus “simulating” 
its own execution for each thread, the memory controller 
obtains accurate information for all L;(). 

The obvious problem with the above implementation 
is that it is expensive in terms of hardware overhead. 
It requires maintaining at least one counter for each 
corexbank pair. Similarly severe, it requires one di- 
vider per core in order to compute the value y;(3) = 
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L;(8)/L;(B) for the thread that is currently running on 
that core in every memory cycle. Fortunately, much 
less expensive hardware implementations are possible 
because the memory controller does not need to know 
the exact values of L;,, and L;,, at any given moment. 
Instead, using reasonably accurate approximate values 
suffices to maintain an excellent level of fairness and se- 
curity. 

Reduce counters by sampling: Using sampling tech- 
niques, the number of counters that need to be main- 
tained can be reduced from O(#Banks x #Cores) to 
O(#Cores) with only little loss in accuracy. Briefly, the 
idea is the following. For each core and its active thread, 
we keep two counters S; and H; denoting the number of 
samples and sampled hits, respectively. Instead of keep- 
ing track of the exact row that would be open in the row- 
buffer if a thread 2 was running alone, we randomly sam- 
ple a subset of requests RE , issued by thread 7 and check 


whether the next request by 2 to the same bank, Re is 
for the same row. If so, the memory controller increases 
both S$; and H;, otherwise, only S; is increased. Requests 
R?,,, to different banks b’ # b served between RE » and 
Ree are ignored. Finally, if none of the Q requests of 
thread 7 following Be, go to bank b, the sample is dis- 
carded, neither S; nor Hi; is increased, and a new sam- 
ple request is taken. With this technique, the probability 
H;/S;, that a request results in a row hit gives the memory 
controller a reasonably accurate picture of each thread’s 
row-buffer locality. An approximation of LD; can thus be 
maintained by adding the expected amortized latency to 
it whenever a request is served, i.e., 


Lee = Leld + (A; /S; , Thit ee (1 rae H;,/S;) : Tear) : 


Reuse dividers: The ideal scheme employs 
O(#Cores) hardware dividers, which significantly 
increases the memory controller’s energy consumption. 
Instead, a single divider can be used for all cores by 
assigning individual threads to it in a round robin 
fashion. That is, while the slowdowns L;(3) and L;(3) 
can be updated in every memory cycle, their quotient 
x, (3) is recomputed in intervals. 


6 Evaluation 
6.1 Experimental Methodology 


We evaluate our solution using a detailed processor 
and memory system simulator based on the Pin dy- 
namic binary instrumentation tool [20]. Our in-house 
instruction-level performance simulator can simulate ap- 
plications compiled for the x86 instruction set architec- 
ture. We simulate the memory system in detail using 
a model loosely based on DRAMsim [36]. Both our 
processor model and the memory model mimic the de- 
sign of a modern high-performance dual-core proces- 


sor loosely based on the Intel Pentium M [11]. The 
size/bandwidth/latency/capacity of different processor 
structures along with the number of cores and other 
structures are parameters to the simulator. The simulator 
faithfully models the bandwidth, latency, and capacity 
of each buffer, bus, and structure in the memory subsys- 
tem (including the caches, memory controller, DRAM 
buses, and DRAM banks). The relevant parameters of 
the modeled baseline processor are shown in Table 1. 
Unless otherwise stated, all evaluations in this section are 
performed on a simulated dual-core system using these 
parameters. For our measurements with the FairMem 
system presented in Section 5, the parameters are set to 
a = 1.025 and 3 = 10°. 

We simulate each application for 100 million x86 in- 
structions. The portions of applications that are sim- 
ulated are determined using the SimPoint tool [32], 
which selects simulation points in the application that 
are representative of the application’s behavior as a 
whole. Our applications include stream and rdarray (de- 
scribed in Section 3), several large benchmarks from the 
SPEC CPU2000 benchmark suite [34], and one memory- 
intensive benchmark from the Olden suite [31]. These 
applications are described in Table 2. 


6.2 Evaluation Results 
6.2.1 Dual-core Systems 


Two microbenchmark applications - stream and 
rdarray: Figure 7 shows the normalized execution 
time of stream and rdarray applications when run alone 
or together using either the baseline FR-FCFS or our 
FairMem memory scheduling algorithms. Execution 
time of each application is normalized to the execution 
time they experience when they are run alone using the 
FR-FCFS scheduling algorithm (This is true for all nor- 
malized results in this paper). When stream and rdarray 
are run together on the baseline system, stream—which 
acts as an MPH—experiences a slowdown of only 1.22X 
whereas rdarray slows down by 2.45X. In contrast, a 
memory controller that uses our FairMem algorithm pre- 
vents stream from behaving like an MPH against rdarray 
— both applications experience similar slowdowns when 
run together. FairMem does not significantly affect per- 
formance when the applications are run alone or when 
run with identical copies of themselves (i.e. when mem- 
ory performance is not unfairly impacted). These exper- 
iments show that our simulated system closely matches 
the behavior we observe in an existing dual-core system 
(Figure 4), and that FairMem successfully provides fair- 
ness among threads. Next, we show that with real appli- 
cations, the effect of an MPH can be drastic. 

Effect on real applications: Figure 8 shows the normal- 
ized execution time of 8 different pairs of applications 
when run alone or together using either the baseline FR- 
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Processor pipeline 
Fetch/Execute width per core 3 instructions can be fetched/executed every cycle; only 1 can be a memory operation 


4 GHz processor, 128-entry instruction window, 12-stage pipeline 


Memory controller 128 request buffer entries, FR-FCFS baseline scheduling policy, runs at 2 GHz 
DRAM parameters 8 banks, 2K-byte row-buffer 


L1 Caches 32 K-byte per-core, 4-way set associative, 32-byte block size, 2-cycle latency 
L2 Caches 512 K-byte per core, 8-way set associative, 32-byte block size, 12-cycle latency 


DRAM latency (round-trip L2 miss latency) | row-buffer hit: 50ns (200 cycles), closed: 75ns (300 cycles), conflict: 100ns (400 cycles) 





Table 1: Baseline processor configuration 


[Benchmark [Suite ___[BrieTdeseription _________ Base performance [Lo misses per IK inst, [row bulfer hit rae 
[Seam || Microbenchmark [Steaming on 32-byieclementanays [430 qoevint | _2965_—«|—~—iO 
[rdarray || Microbenchmark [Random access on arrays [56.29 eyclesfinst |___ 29.18 [3% 
[small-sieam || Microbenchmark [Streaming on 4-byte-clement amays | 13.86cyclesfinst |____7143___—«| 7% 
fat | SPEC2000 FP [Object recognition in thermal image | 7-85 eyclesfinst. |___7082___ | __ 8% __ 
[crafty | SPEC2000INT [Chess game (| ‘0.6 eyclesfinst_ | ___035_ «| sia 


Phealth [Olden | 


Columbian health care system simulator | 7.24 cycles/inst. 83.45 


|mcef ———_—s[| SPEC 2000 INT | Single-depot vehicle scheduling 4.73 cycles/inst. 45.95 





|vpr_ || SPEC 2000 INT _ | FPGA circuit placement and routing 1.71 cycles/inst. 


Table 2: Evaluated applications and their performance characteristics on the baseline processor 
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Figure 7: Slowdown of (a) stream and (b) rdarray benchmarks using FR-FCFS and our FairMem algorithm 


FCFS or FairMem. The results show that 1) an MPH can 
severely damage the performance of another application, 
and 2) our FairMem algorithm is effective at preventing 
it. For example, when stream and health are run together 
in the baseline system, stream acts as an MPH slowing 
down health by 8.6X while itself being slowed down by 
only 1.05X. This is because it has 7 times higher L2 miss 
rate and much higher row-buffer locality (96% vs. 27%) 
— therefore, it exploits unfairness in both row-buffer- 
hit first and oldest-first scheduling policies by flooding 
the memory system with its requests. When the two 
applications are run on our FairMem system, health’s 
slowdown is reduced from 8.63X to 2.28X. The figure 
also shows that even regular applications with high row- 
buffer locality can act as MPHs. For instance when art 
and vpr are run together in the baseline system, art acts as 
an MPH slowing down vpr by 2.35X while itself being 
slowed down by only 1.05X. When the two are run on 
our FairMem system, each slows down by only 1.35X; 
thus, art is no longer a performance hog. 

Effect on Throughput and Unfairness: Table 3 shows 
the overall throughput (in terms of executed instructions 
per 1000 cycles) and DRAM unfairness (relative dif- 
ference between the maximum and minimum memory- 
related slowdowns, defined as WV in Section 4) when dif- 
ferent application combinations are executed together. In 


all cases, FairMem reduces the unfairness to below 1.20 
(Remember that 1.00 is the best possible W value). In- 
terestingly, in most cases, FairMem also improves over- 
all throughput significantly. This is especially true when 
a very memory-intensive application (e.g.stream) is run 
with a much less memory-intensive application (e.g.vpr). 
Providing fairness leads to higher overall system 
throughput because it enables better utilization of the 
cores (i.e. better utilization of the multi-core system). 
The baseline FR-FCFS algorithm significantly hinders 
the progress of a less memory-intensive application, 
whereas FairMem allows this application to stall less 
due to the memory system, thereby enabling it to make 
fast progress through its instruction stream. Hence, 
rather than wasting execution cycles due to unfairly- 
induced memory stalls, some cores are better utilized 
with FairMem.!! On the other hand, FairMem re- 
duces the overall throughput by 9% when two extremely 
memory-intensive applications,stream and rdarray, are 
run concurrently. In this case, enforcing fairness reduces 
stream’s data throughput without significantly increas- 
ing rdarray’s throughput because rdarray encounters L2 
cache misses as frequently as stream (see Table 2). 


Note that the data throughput obtained from the DRAM itself may 
be, and usually is reduced using FairMem. However, overall through- 
put in terms of instructions executed per cycle usually increases. 
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Figure 8: Slowdown of different application combinations using FR-FCFS and our FairMem algorithm 
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Table 3: Effect of FairMem on overall throughput (in terms of instructions per 1000 cycles) and unfairness 


6.2.2 Effect of Row-buffer Size 


From the above discussions, it is clear that the exploita- 
tion of row-buffer locality by the DRAM memory con- 
troller makes the multi-core memory system vulnerable 
to DoS attacks. The extent to which this vulnerability can 
be exploited is determined by the size of the row-buffer. 
In this section, we examine the impact of row-buffer size 
on the effectiveness of our algorithm. For these sensitiv- 
ity experiments we use two real applications, art and vpr, 
where art behaves as an MPH against vpr. 

Figure 9 shows the mutual impact of art and vpr on 
machines with different row-buffer sizes. Additional 
statistics are presented in Table 4. As row-buffer size in- 
creases, the extent to which art becomes a memory per- 
formance hog for vpr increases when FR-FCFS schedul- 
ing algorithm is used. In a system with very small, 512- 
byte row-buffers, vpr experiences a slowdown of 1.65X 
(versus art’s 1.05X). In a system with very large, 64 KB 
row-buffers, vpr experiences a slowdown of 5.50X (ver- 
sus art’s 1.03X). Because art has very high row-buffer 
locality, a large buffer size allows its accesses to occupy 
a bank much longer than a small buffer size does. Hence, 





art’s ability to deny bank service to vpr increases with 
row-buffer size. FairMem effectively contains this denial 
of service and results in similar slowdowns for both art 
and vpr (1.32X to 1.41X). It is commonly assumed that 
row-buffer sizes will increase in the future to allow better 
throughput for streaming applications [41]. As our re- 
sults show, this implies that memory-related DoS attacks 
will become a larger problem and algorithms to prevent 
them will become more important.!* 


6.2.3 Effect of Number of Banks 


The number of DRAM banks is another important pa- 
rameter that affects how much two threads can interfere 
with each others’ memory accesses. Figure 10 shows 
the impact of art and vpr on each other on machines 
with different number of DRAM banks. As the num- 
ber of banks increases, the available parallelism in the 


Note that reducing the row-buffer size may at first seem like one 
way of reducing the impact of memory-related DoS attacks. However, 
this solution is not desirable because reducing the row-buffer size sig- 
nificantly reduces the memory bandwidth (hence performance) for ap- 
plications with good row-buffer locality even when they are running 
alone or when they are not interfering with other applications. 
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Figure 9: Normalized execution time of art and vpr when run together on processors with different row-buffer sizes. 
Execution time is independently normalized to each machine with different row-buffer size. 
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Figure 10: Slowdown of art and vpr when run together on processors with various number of DRAM banks. Execution 
time is independently normalized to each machine with different number of banks. 
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Table 5: Statistics for art-vpr with different number of DRAM banks (IPTC: Instructions/1000-cycles) 


memory system increases, and thus art becomes less of a 
performance hog; its memory requests conflict less with 
vpr’s requests. Regardless of the number of banks, our 
mechanism significantly mitigates the performance im- 
pact of art on vpr while at the same time improving over- 
all throughput as shown in Table 5. Current DRAMs 
usually employ 4-16 banks because a larger number of 
banks increases the cost of the DRAM system. In a sys- 
tem with 4 banks, art slows down vpr by 2.64X (while 
itself being slowed down by only 1.10X). FairMem is 
able to reduce vpr’s slowdown to only 1.62X and im- 
prove overall throughput by 32%. In fact, Table 5 shows 
that FairMem achieves the same throughput on only 4 
banks as the baseline scheduling algorithm on 8 banks. 


6.2.4 Effect of Memory Latency 


Clearly, memory latency also has an impact on the vul- 
nerability in the DRAM system. Figure 11 shows how 
different DRAM latencies influence the mutual perfor- 
mance impact of art and vpr. We vary the round-trip 
latency of a request that hits in the row-buffer from 50 
to 1000 processor clock cycles, and scale closed/conflict 
latencies proportionally. As memory latency increases, 
the impact of art on vpr also increases. Vpr’s slowdown 


is 1.89X with a 50-cycle latency versus 2.57X with a 
1000-cycle latency. Again, FairMem reduces art’s im- 
pact on vpr for all examined memory latencies while 
also improving overall system throughput (Table 6). As 
main DRAM latencies are expected to increase in mod- 
ern processors (in terms of processor clock cycles) [39], 
scheduling algorithms that mitigate the impact of MPHs 
will become more important and effective in the future. 


6.2.5 Effect of Number of Cores 


Finally, this section analyzes FairMem within the con- 
text of 4-core and 8-core systems. Our results show that 
FairMem effectively mitigates the impact of MPHs while 
improving overall system throughput in both 4-core and 
8-core systems running different application mixes with 
varying memory-intensiveness. 

Figure 12 shows the effect of FairMem on three dif- 
ferent application mixes run on a 4-core system. In 
all the mixes, stream and small-stream act as severe 
MPHs when run on the baseline FR-FCFS system, slow- 
ing down other applications by up to 10.4X (and at least 
3.5X) while themselves being slowed down by no more 
than 1.10X. FairMem reduces the maximum slowdown 
caused by these two hogs to at most 2.98X while also 
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Figure 11: Slowdown of art and vpr when run together on processors with different DRAM access latencies. Execution 
time is independently normalized to each machine with different number of banks. Row-buffer hit latency is denoted. 
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Figure 12: Effect of FR-FCFS and FairMem scheduling on different application mixes in a 4-core system 


improving the overall throughput of the system (Table 7). 
Figure 13 shows the effect of FairMem on three dif- 
ferent application mixes run on an 8-core system. Again, 
in the baseline system, stream and small-stream act as 
MPHs, sometimes degrading the performance of another 
application by as much as 17.6X. FairMem effectively 
contains the negative performance impact caused by the 
MPHs for all three application mixes. Furthermore, it 
is important to observe that FairMem is also effective 
at isolating non-memory-intensive applications (such as 
crafty in MIX2 and MIX3) from the performance degra- 
dation caused by the MPHs. Even though crafty rarely 
generates a memory request (0.35 times per 1000 instruc- 
tions), it is slowed down by 7.85X by the baseline sys- 
tem when run within MIX2! With FairMem crafty’s rare 
memory requests are not unfairly delayed due to a mem- 
ory performance hog — and its slowdown is reduced to 
only 2.28X. The same effect is also observed for crafty in 
MIX3.!> We conclude that FairMem provides fairness in 
the memory system, which improves the performance of 
both memory-intensive and non-memory-intensive ap- 
plications that are unfairly delayed by an MPH. 


7 Related Work 


The possibility of exploiting vulnerabilities in the soft- 
ware system to deny memory allocation to other appli- 
cations has been considered in a number of works. For 


'3Notice that 8p-MIX2 and 8p-MIX3 are much less memory inten- 
sive than 8p-MIX1. Due to this, their baseline overall throughput is 
significantly higher than 8p-MIX1 as shown in Table 7. 


example, [37] describes an attack in which one process 
continuously allocates virtual memory and causes other 
processes on the same machine to run out of memory 
space because swap space on disk is exhausted. The 
“memory performance attack” we present in this paper 
is conceptually very different from such “memory allo- 
cation attacks” because (1) it exploits vulnerabilities in 
the hardware system, (2) it is not amenable to software 
solutions — the hardware algorithms must be modified 
to mitigate the impact of attacks, and (3) it can be caused 
even unintentionally by well-written, non-malicious but 
memory-intensive applications. 

There are only few research papers that consider hard- 
ware security issues in computer architecture. Woo and 
Lee [38] describe similar shared-resource attacks that 
were developed concurrently with this work, but they do 
not show that the attacks are effective in real multi-core 
systems. In their work, a malicious thread tries to dis- 
place the data of another thread from the shared caches or 
to saturate the on-chip or off-chip bandwidth. In contrast, 
our attack exploits the unfairness in the DRAM memory 
scheduling algorithms; hence their attacks and ours are 
complementary. 

Grunwald and Ghiasi [12] investigate the possibility of 
microarchitectural denial of service attacks in SMT (si- 
multaneous multithreading) processors. They show that 
SMT processors exhibit a number of vulnerabilities that 
could be exploited by malicious threads. More specif- 
ically, they study a number of DoS attacks that affect 
caching behavior, including one that uses self-modifying 
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Figure 13: Effect of FR-FCFS and FairMem scheduling on different application mixes in an 8-core system 


LMT Yap MAT [aM TT BPMN | BP MIND | BMG] 
base throughput (PTC) [7 [| oe] 38 i | os | 13 


FairMem throughput (IPTC) [| 179 | 


338 


234 189 1233 2809 


base unfaimess CU) [8053711098789 [1336 [ 10 


1.09 


1.32 


1.21 1.18 1.34 1.32 


FairMfem unfairness (1) [10 3a ea 
[FairMem throughput improvement || 167 | 217X | 14aX— | 144X_|_T97X_|_157X_| 
[FairMem faimess improvement [| 739X | 6.60X_[_9.07X || 6.69X_| 10.11X |_7.66X_| 





Table 7: Throughput and fairness statistics for 4-core and 8-core systems 


code to cause the trace cache to be flushed. The authors 
then propose counter-measures that ensure fair pipeline 
utilization. The work of Hasan et al. [13] studies in a sim- 
ulator the possibility of so-called heat stroke attacks that 
repeatedly access a shared resource to create a hot spot at 
the resource, thus slowing down the SMT pipeline. The 
authors propose a solution that selectively slows down 
malicious threads. These two papers present involved 
ways of “hacking” existing systems using sophisticated 
techniques such as self-modifying code or identifying 
on-chip hardware resources that can heat up. In contrast, 
our paper describes a more prevalent problem: a triv- 
ial type of attack that could be easily developed by any- 
one who writes a program. In fact, even existing simple 
applications may behave like memory performance hogs 
and future multi-core systems are bound to become even 
more vulnerable to MPHs. In addition, neither of the 
above works consider vulnerabilities in shared DRAM 
memory in multi-core architectures. 


The FR-FCFS scheduling algorithm implemented in 
many current single-core and multi-core systems was 
studied in [30, 29, 15, 23], and its best implementation— 
the one we presented in Section 2—is due to Rixner 
et al [30]. This algorithm was initially developed for 
single-threaded applications and shows good through- 
put performance in such scenarios. As shown in [23], 
however, it can have negative effects on fairness in chip- 
multiprocessor systems. The performance impact of dif- 
ferent memory scheduling techniques in SMT processors 
and multiprocessors has been considered in [42, 22]. 


Fairness issues in managing access to shared resources 
have been studied in a variety of contexts. Network fair 
queuing has been studied in order to offer guaranteed ser- 
vice to simultaneous flows over a shared network link, 
e.g., [24, 40, 3], and techniques from network fair queu- 
ing have since been applied in numerous fields, e.g., CPU 
scheduling [6]. The best currently known algorithm for 


network fair scheduling that also effectively solves the 
idleness problem was proposed in [2]. In [23], Nesbit et 
al. propose a fair memory scheduler that uses the def- 
inition of fairness in network queuing and is based on 
techniques from [3, 40]. As we pointed out in Section 4, 
directly mapping the definitions and techniques from net- 
work fair queuing to DRAM memory scheduling is prob- 
lematic. Also, the scheduling algorithm in [23] can sig- 
nificantly suffer from the idleness problem. Fairness in 
disk scheduling has been studied in [4, 26]. The tech- 
niques used to achieve fairness in disk access are highly 
influenced by the physical association of data on the disk 
(cylinders, tracks, sectors, etc.) and can therefore not di- 
rectly be applied in DRAM scheduling. 

Shared hardware caches in multi-core systems have 
been studied extensively in recent years, e.g. in [35, 19, 
14, 28, 9]. Suh et al. [35] and Kim et al. [19] develop 
hardware techniques to provide thread-fairness in shared 
caches. Fedorova et al. [9] and Suh et al. [35] propose 
modifications to the operating system scheduler to allow 
each thread its fair share of the cache. These solutions do 
not directly apply to DRAM memory controllers. How- 
ever, the solution we examine in this paper has interac- 
tions with both the operating system scheduler and the 
fairness mechanisms used in shared caches, which we 
intend to examine in future work. 


8 Conclusion 


The advent of multi-core architectures has spurred a lot 
of excitement in recent years. It is widely regarded as the 
most promising direction towards increasing computer 
performance in the current era of power-consumption- 
limited processor design. In this paper, we show that this 
development—besides posing numerous challenges in 
fields like computer architecture, software engineering, 
or operating systems—bears important security risks. 

In particular, we have shown that due to unfairness in 
the memory system of multi-core architectures, some ap- 
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Abstract 


Reverse engineering of software is the process of recov- 
ering higher-level structure and meaning from a lower- 
level program representation. It can be used for legit- 
imate purposes—e.g., to recover source code that has 
been lost—but it is often used for nefarious purposes, 
e.g., to search for security vulnerabilities in binaries or to 
steal intellectual property. This paper addresses the prob- 
lem of making it hard to reverse engineering binary pro- 
grams by making it difficult to disassemble machine code 
statically. Binaries are obfuscated by changing many 
control transfers into signals (traps) and inserting dummy 
control transfers and “junk” instructions after the signals. 
The resulting code is still a correct program, but even 
the best current disassemblers are unable to disassemble 
40%-60% of the instructions in the program. Further- 
more, the disassemblers have a mistaken understanding 
of over half of the control flow edges. However, the ob- 
fuscated program necessarily executes more slowly than 
the original. Experimental results quantify the degree of 
obfuscation, stealth of the code, and effects on execution 
time and code size. 


1 Introduction 


Software is often distributed in binary form, without 
source code. Many groups have developed technology 
that enables one to reverse engineer binary programs 
and thereby reconstruct the actions and structure of the 
program. This is accomplished by disassembling ma- 
chine code into assembly code and then possibly de- 
compiling the assembly code into higher level repre- 
sentations [5, 6, 13]. While reverse-engineering tech- 
nology has many legitimate uses (in particular, an im- 
portant application of binary-level reverse engineering 
is to analyse malware in order to understand its behav- 
ior [4, 16-18, 25, 27, 33]), it can also be used to dis- 
cover vulnerabilities, make unauthorized modifications, 
or steal intellectual property. 


This work was supported in part by NSF Grants EIA-0080123, 
CCR-0113633, and CNS-0410918. 


Since the first step in reverse engineering a binary 
is disassembly, many approaches to binary obfuscation 
focus on disrupting this step. This is typically done 
by identifying assumptions made by disassemblers, then 
transforming the program systematically so as to violate 
these assumptions without altering program functional- 
ity. Two fundamental assumptions made by disassem- 
blers are that (1) the address where each instruction be- 
gins can be determined; and (2) control transfer instruc- 
tions can be identified and their targets determined. The 
first assumption is used to identify the actual instructions 
to disassemble; most modern disassemblers use the sec- 
ond assumption to determine which memory regions get 
disassembled (see Section 2). In this context, this paper 
makes the following contributions: 


1. It shows how the second of these assumptions can 
be violated, such that actual control transfers in the 
program cannot be identified by a static disassem- 
bler. This is done by replacing control transfer 
instructions—jumps, calls, and returns—by “ordi- 
nary” instructions whose execution raises traps at 
runtime; these traps are then fielded by signal han- 
dling code that carries out the appropriate control 
transfer. The effect is to replace control transfer 
instructions either with apparently innocuous arith- 
metic or memory operations or with what appear to 
be illegal instructions that suggest an erroneous dis- 
assembly. 


2. It shows how the code resulting from this first trans- 
formation can be further obfuscated to additionally 
violate the first assumption stated above. This is 
done using a secondary transformation that inserts 
(unreachable) code, containing fake control trans- 
fers, after these trap-raising instructions, in order to 
make it hard to find the beginning of the true next 
instructions. 


In earlier work, we showed how disassembly could be 
disrupted by violating the first assumption [20]; this pa- 
per extends that work by showing a different way to ob- 
fuscate binaries by replacing control transfer instructions 
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with apparently-innocuous non-control-transfer instruc- 
tions. It is also very different from our earlier work on 
intrusion detection [21], which proposed a way to hinder 
certain kinds of mimicry attacks by obfuscating system 
call instructions. That work sought simply to disguise the 
instruction (‘int $0x80’ in Intel x86 processors) used by 
applications to trap into the OS kernel; more importantly, 
it required kernel modifications in order to to work. By 
contrast, the work described in this paper applies to arbi- 
trary control transfers in programs and requires no kernel 
modifications. Taken together, these two differences lead 
to significant differences between the two approaches in 
terms of goals, techniques, and effects. 


It is important to note that code obfuscation is merely 
a technique: just as it can be used to protect software 
against attackers, so too it can be used to hide malicious 
content. The work presented here can therefore be seen 
from two perspectives: as a “defense model” of a new 
approach for protecting intellectual property, or as an 
“attack model” of a new approach for hiding malicious 
content. In either case, it goes well beyond current ap- 
proaches to hiding the content of executable code. In 
particular, the obfuscations cause the best existing dis- 
assemblers to miss 40%-—60% of the instructions in test 
programs and to make mistakes on over half of the con- 
trol flow edges. 


The remainder of the paper is organized as follows. 
Section 2 provides background information on static dis- 
assembly algorithms. Section 3 describes the new tech- 
niques for thwarting disassembly and explains how they 
are implemented. Section 4 describes how we evaluate 
the efficacy of our approach. Section 5 gives experimen- 
tal results for programs in the SPECint-2000 benchmark 
suite. Section 6 describes related work, and Section 7 
contains concluding remarks. 


2 Disassembly Algorithms 


This section summarizes the operation of disassemblers 
in order to provide the context needed to understand 
how our obfuscation techniques work. Broadly speak- 
ing, there are two approaches to disassembly: static and 
dynamic, the difference between them being that the for- 
mer examines the program without execution, while the 
latter monitors the program’s execution (e.g., through a 
debugger) as part of the disassembly process. Static dis- 
assembly processes the entire input program all at once, 
while dynamic disassembly only disassembles those in- 
structions that were executed for the particular input that 
was used. Moreover, with static disassembly it is eas- 
ier to apply offline program analyses to reason about 
semantic aspects of the program under consideration. 
Finally, programs being disassembled statically are not 


able to defend themselves against reverse engineering us- 
ing anti-debugging techniques (see, for example, [2, 3]). 
For these reasons, static disassembly is a popular choice 
for low level reverse engineering. This paper focuses on 
static disassembly: its goal is to render static disassem- 
bly of programs sufficiently difficult and expensive as to 
force attackers to resort to dynamic approaches (which, 
in principle, can then be defended against). 


There are two generally used techniques for static dis- 
assembly: linear sweep and recursive traversal [26]. The 
linear sweep algorithm begins disassembly at the input 
program’s first executable location, and simply sweeps 
through the entire text section disassembling each in- 
struction as it is encountered. This method is used by 
programs such as the GNU utility obj dump [24] as well 
as a number of link-time optimization tools [8, 23, 29]. 
The main weakness of linear sweep is that it is prone to 
disassembly errors resulting from the misinterpretation 
of data, such as jump tables, embedded in the instruction 
stream. 


The recursive traversal algorithm uses the control flow 
instructions of the program being disassembled in or- 
der to determine what to disassemble. It starts with 
the program’s entry point, and disassembles the first ba- 
sic block. When the algorithm encounters a control 
flow instruction, it determines the possible successors of 
that instruction—i.e., addresses where execution could 
continue—and proceeds with disassembly at those ad- 
dresses. Variations on this basic approach to disassem- 
bly are used by a number of binary translation and opti- 
mization systems [6, 28, 30]. The main virtue of recur- 
sive traversal is that by following the control flow of a 
program, it is able to “go around” and thus avoid disas- 
sembly of data embedded in the text section. Its main 
weakness is that it depends on being able to determine 
the possible successors of each such instruction, which 
is difficult for indirect jumps and calls. The algorithm 
also depends on being able to find all the instructions 
that affect control flow. 


A recently proposed generalization of recursive traver- 
sal is that of exhaustive disassembly [14,15], which is the 
most sophisticated disassembly algorithm we are aware 
of. This approach aims to work around certain kinds 
of binary obfuscations by considering all possible disas- 
semblies of each function. It examines the control trans- 
fer instructions in these alternative disassemblies to iden- 
tify basic block boundaries, then uses a variety of heuris- 
tic and statistical reasoning to rule out alternatives that 
are unlikely or impossible. Like the recursive traversal 
algorithm it generalizes, the exhaustive algorithm thus 
also relies fundamentally on identifying and analyzing 
the behavior of control transfer instructions. 
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3 Signal-Based Obfuscation 


3.1 Overview 


In order to confuse a disassembler, we have to disrupt its 
notion of where the instructions are, what they are doing, 
and what the control flow is. The choices we have for al- 
tering the program are (1) changing instructions to others 
that produce the same result, and (2) adding instructions 
that do not have visible effects. Simple, local changes 
will obviously not confuse a disassembler or a human. 
More global and drastic changes are required. 


The essential intuition of our approach can be illus- 
trated via a simple example, given in Figure 1. The origi- 
nal code fragment on the left-hand side of the figure con- 
tains an unconditional jump to a location L; the jump 
is preceded by Code-before and followed by Code-after. 
This code is obfuscated by replacing the jump by code 
that attempts to access an illegal memory location @ and 
thereby generates a trap, which raises a signal. This is 
fielded by a handler that uses the address of the instruc- 
tion that caused the trap to determine the target address 
L of the original jump instruction and to cause control to 
branch to L. In addition, Bogus Code is inserted after the 
trap point; this code appears to be reachable, but in fact it 
is not. Judicious choice of bogus code can throw off the 
disassembly even further. 


This example illustrates a number of key aspects of 
our approach that increases the difficulty of statically de- 
obfuscating programs: 


— A variety of different instructions and addresses can 
be used to raise a signal at runtime. The example 
uses a load from an illegal address, but we could 
have used many other alternatives, e.g.,a tore to 
a write-protected location, or a load from a read- 
protected location. Indeed, on an architecture such 
as the Intel x86, any instruction that can take a mem- 
ory operand, including all the familiar arithmetic in- 
structions, can be used for this purpose. Moreover, 
the address @ used to generate the trap can be a legal 
address, albeit one that does not (at runtime) permit 
a particular kind of memory access. We can further 
hamper static reverse engineering by using some- 
thing likeanmprotect — system call to (possibly 
temporarily) change the protection of the address ¢ 
being used to generate the trap at runtime, so that it 
is not statically obvious that attempting a particular 
kind of memory access at address ¢ will raise a trap. 


— The address ¢ used to generate the trap need not be 
a determinate value. For example, suppose that, as 
in typical 32-bit Linux systems, the top | GB of the 
virtual address space (i.e., addresses 


to ) is reserved for the kernel, and is 
inaccessible to user processes. Then, any value of @ 
in that address range will serve to generate the de- 
sired trap. Such values can be computed by starting 
with an arbitrary value and then using bit manipu- 
lations to obtain a value in the appropriate range, as 
shown below (one can imagine many variations on 
this theme), where A and B are arbitrary legal mem- 
ory locations: 


ro := contents of A 
r, := contents of B 
ry i=P| /* r,’s low byte > */ 
24. /Frj > "7 


MY s=Pr, 
fo-=To Ti 


The actual runtime contents of memory locations 
A and B are unimportant here: the value computed 
into rg—which may be different on different execu- 
tions of this code—will nevertheless always point 
into protected kernel address space, causing mem- 
ory accesses through rp to generate a trap. Such 
indeterminacy can further complicate the task of re- 
verse engineering the obfuscated code. Note that 
such indeterminate address computations can also 
be applied to generate an arbitrary address within a 
page (or pages) protected using mprotect as 
discussed above. 


— A variety of different traps can be used. For exam- 
ple, in addition to the memory access traps men- 
tioned above, we can use arithmetic exceptions, 
e.g., divide-by-zero. In fact, the “instruction” gen- 
erating a trap need not be a legal instruction at all— 
i.e., we can use a byte pattern that does not cor- 
respond to any legal instruction to effect a control 
transfer via an illegal instruction trap. Such ille- 
gal byte sequences—which in general are indistin- 
guishable from data legitimately embedded in the 
instruction stream—can be very effective in confus- 
ing disassemblers. 


— The location following the trap-generating instruc- 
tion is unreachable, but this is not evident from 
standard control flow analyses. We can exploit this 
by inserting additional “bogus” code after the trap- 
generating instruction to further confuse disassem- 
bly. Section 3.2 discusses this in more detail. 


We could conceivably obfuscate every jump, call, or 
return in the source code. However, this would cause the 
program to execute much more slowly because of signal- 
processing overhead. We allow the user to specify a hot- 
code threshold, and we only obfuscate control transfers 
that are not in hot parts of the original program (see Sec- 
tion 5 for details). Even so, we are able to obfuscate 
about a third of the instructions in hot code blocks. 
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Code-before Code-before 


jmp L = __ J compute a value £ 
Code-after obfuscation Bre { /* € is an illegal address */ 
sae ri=load r /* trap — Segmentation fault */ 
Ee. 8s Bogus Code /* unreachable */ 


Code-after 


Figure 1: A Simple Example of our Approach to Obfuscation 


Before obfuscating a program, we first instrument the 
program to gather edge profiles, and then we run the 
instrumented version on a training input. The obfusca- 
tion process itself has several steps. First, using the pro- 
file data and hot-cold threshold, determine which con- 
trol transfers should be obfuscated and modify each such 
instruction as shown in Figure 1. Second, insert bogus 
code at unreachable code locations such as after trap- 
generating instructions. Third, intersperse signal han- 
dling and restore (return from signal) code with the orig- 
inal program code. Fourth, compute the new memory 
layout, construct a table of mappings from trap instruc- 
tions to target addresses, and patch the restore code to use 
this table via a perfect hash function. Finally, assemble a 
new, obfuscated binary. 


3.2 Program Obfuscations 


Within our obfuscator, the original program is repre- 
sented as an interprocedural control-flow graph (ICFG). 
The nodes are basic blocks of machine instructions; the 
edges represent the control flow in the program. 


Obfuscating Control Transfers 


After some initialization actions, our obfuscator makes 
one pass through the original program to flip conditional 
branches—i.e., reverse the sense of the branch condi- 
tion and insert an explicit unconditional jump after it to 
maintain the program’s semantics. This transformation 
has the effect of increasing the set of candidate locations 
where our obfuscation can be applied. Our obfuscator 
then makes a second pass through the program to find 
and modify all control transfer instructions that are to be 
obfuscated. 


To obfuscate a control transfer instruction, we insert 
Setup code that prepares for raising a signal and then 
Trap code that causes a signal. The Setup code (1) al- 
locates space on the stack for use by the signal handler 
to store the address of the trap instruction, and (2) sets 
a flag that indicate to the signal handler that the com- 
ing signal is from obfuscated code, not the original pro- 


gram itself. To set a flag, we use a pre-allocated array 
(initialized to zero), and the Setup code moves a random 
non-zero value into a randomly chosen element of the ar- 
ray. Jump, return, and call instructions are obfuscated in 
nearly identical ways; the only essential difference is the 
amount of stack space that we need to allocate in order to 
effect the intended control transfer. The Trap code gen- 
erates a trap, which in turn raises a signal. In order not 
to interfere with signal handlers that might be installed 
in the original program, we only raise signals for which 
the default action is to dump core and terminate the pro- 
gram. In particular we use illegal instruction (SIGILL), 
floating point exception (SIGFPE), and segmentation vi- 
olation (SIGSEGV). 


To determine which kind of trap to raise—and to avoid 
the need to save and later restore program registers—we 
first do liveness analysis to determine which registers are 
live at the trap point and which are available for us to 
use. If no register is available, we randomly generate an 
illegal instruction from among several possible choices. 
Otherwise, we generate code to load a zero into a free 
register r, then either dereference r (to cause a segmen- 
tation fault), or divide by r (to cause a floating point ex- 
ception). Since indirect loads are far more frequent than 
divides in real programs, most of the time we choose the 
former. 


If we simply moved a zero into a register each time 
that we wanted to trigger a floating point exception or 
segmentation fault, there would be dozens of such in- 
structions that would be a signature for our obfuscation. 
To avoid this, we generate a sequence of instructions by 
using multiple, randomly chosen rewriting rules that per- 
form value-preserving transformations on the registers 
that are free at each obfuscation point. Appendix A de- 
scribes how we randomize the computation of values. 


Inserting Bogus Code 


After obfuscating a control transfer instruction, we next 
insert bogus code—a conditional branch and some junk 
bytes—to further confuse disassemblers. This is shown 
in Figure 2. Since the trap instruction has the effect of 
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Code—before 
jmp Addr 


ir Code-—after 


(a) Original code 


Me 
iF Code-after 


Code—before 





Setup code ee nage 
trap instruction -~ 
conditional branch 
=<—¥ unreachable 
Junk bytes 











(b) Obfuscated code 


Figure 2: Bogus code insertion 


an unconditional control transfer, the conditional branch 
immediately following the trap is “bogus code” that will 
not be reachable in the obfuscated program, and hence 
it will not be executed. The purpose of adding this in- 
struction is to confuse the control flow analysis of the 
program by misleading the disassembler into identify- 
ing a spurious edge in the control flow graph; the con- 
trol flow edge so introduced can also lead to further dis- 
assembly errors at the target of this control transfer. A 
secondary benefit of such bogus conditional branches is 
that they help improve the stealthiness of the obfuscation, 
since otherwise the disassembly would produce what ap- 
peared to be long sequences of straight-line code without 
any branches, which would not resemble code commonly 
encountered in practice. We randomly select an uncon- 
ditional branch—based on how frequently the different 
kinds occur in normal programs—and use a random PC- 
relative displacement. 


The junk bytes are a proper prefix of a legal instruc- 
tion. The goal is to cause a disassembler to consume the 
first few bytes of Code-after when it completes the in- 
struction that starts with the junk bytes. This will ideally 
cause it to continue to misidentify the true instruction 
boundaries for at least a while.! We determine the prefix 
length n that maximizes the disassembly error for subse- 
quent instructions (nm depends only on the instructions in 
Code-after), and insert the first n bytes of an instruction 
chosen randomly from a number of different alternatives. 


Building the Mapping Table 


After obfuscating control flow and inserting bogus code, 
our obfuscator computes a memory layout for the obfus- 
cated program and determines final memory addresses. 


'This technique only works on variable-instruction-length architec- 
tures such as the IA-32. Moreover, disassemblers tend to resynchronize 
relatively quickly, so that on average they are confused for only three or 
four instructions before again finding the true instruction boundaries. 


Among these are the addresses of all the trap instruc- 
tions that have been inserted. The obfuscator then goes 
through the control flow graph and gathers the informa- 
tion it needs to build a table that maps trap locations to 
original targets. 


Suppose that N control transfer instructions have been 
obfuscated. Then there are N rows in the mapping table, 
one for each trap point. Each row contains a flag that in- 
dicates the type of transfer that was replaced, and zero, 
one, or two target addresses, depending on the value of 
the flag. To make it hard to reverse engineer the contents 
and use of this table, we use two techniques. First, we 
generate a perfect hash function that maps the N trap ad- 
dresses to distinct integers from 0 to N — 1 [12], and we 
use this function to get indices into the mapping table; 
this machine code is quite inscrutable and hence hard to 
reverse engineer. Second, to make it hard to discover the 
target addresses in the mapping table, in place of each 
target address T we store a value XT that is the XOR of 
T and the corresponding trap address S. 


3.3 Signal Handling 


When an instruction raises a signal, the processor stores 
its address S on the stack, then traps into the kernel. Fig- 
ure 3(a) shows the components and control transfers that 
normally occur when a program raises a signal at address 
S and has installed a signal handler that returns back 
to the program at the same address. (If no handler has 
been installed, the kernel takes the default action for the 
signal.) Figure 3(b) shows the components and control 
transfers that occur in our implementation. The essential 
differences are that we return control to a different target 
address T, and we do so by causing the kernel to transfer 
control to our restore code rather than back to the trap 
address. We allow obfuscated programs to install their 
own signal handlers, as described below. 
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Kernel trap handler 


S: Trap instruction User’s signal handler 


Kernel restore function 


(a) Normal signal handling 


Kernel trap handler 


S: Trap instruction 
Our signal handler 


Kernel restore function 
T: Target instruction | 


Our restore function 


(b) Our signal handling path 


Figure 3: Signal Handling: Normal and Obfuscated Cases 


Handler and Restore Code Actions 


We trigger the path shown in Figure 3(b) when a signal 
is raised from a trap location that we inserted in the bi- 
nary. However, other instructions in the original program 
might raise the illegal instruction, floating point excep- 
tion, or segmentation fault signals. To tell the difference, 
we use a global array that is initialized to zero. In the 
Setup code before each of the traps we insert in the pro- 
gram, we set a random element of this array to a non-zero 
value. In our signal handler, we loop through this array to 
see if any value is nonzero (and we then reset it to zero). 


In the normal case where our signal handler is pro- 
cessing one of the traps we inserted in the program, it 
overwrites the kernel restore function’s return address 
with the address of our restoration code. That code (1) 
invokes the perfect hash function on the trap address S 
(which was put in the stack space allocated by our signal 
handler), (2) looks up the original target address, (3) re- 
sets the stack frame as appropriate for the type of control 
transfer and (4) transfers control (via a return instruction) 
to the original target address. 


To make it harder for an attacker to find and reverse 
engineer the signal handler, we disperse our handler and 
restore code over the program, i.e., we split the code into 
multiple basic blocks and interleave these in with the 
original program code. We also make multiple slightly 
different copies of each code block so that we are not 
always using the same locations each time we handle a 
signal. As will be shown in Section 5, we are able to 
obfuscate many hot instructions as a side effect of ob- 
fuscating cold code. These include some of the code we 
introduce to handle signals. 


Interaction With Other Signals 


We allow the original program to install signal handlers 
and dynamically to change signal handling semantics. 
By analyzing the binary, we determine whether it in- 


stalls signal handlers: this is done by checking to see 
whether there are any calls to system library routines 
(e.g., ignal__ ) that install signal handlers. We trans- 
form the code to intercept these calls at runtime and 
record, in a table, the signals that are being handled and 
the address of the corresponding signal handler. When 
our signal handler determines that a signal did not get 
raised by one of our obfuscations (by examining the ar- 
ray of flags), it consults this table. If the user installed 
a handler, we call that handler then return to the original 
program. Otherwise, we take the default action for that 
kind of signal. 


Although in general we are able to handle interactions 
between signal handling in our code and the original pro- 
gram, we discovered one instance of a race condition. In 
particular, one of the SPECint-95 benchmark programs, 
m&8ksim, installs a handler for SIGINT, the interrupt sig- 
nal. If we obfuscate that program, run the code, and in- 
terrupt the program while it happens to be in our han- 
dler, the program will cause a segmentation fault and 
crash. To solve this type of problem, our signal han- 
dler needs to delay the processing of other signals that 
might be raised. (On Unix this can be done by having 
the signal handler call the igprocma function, or 
by using igaction when we (re)install the handler.) 
Once our trap processing code gets back to the restore 
code block of the obfuscated program, it can safely be 
interrupted because it is through manipulating kernel ad- 
dresses. However, our current implementation does not 
yet block other signals. 


An even worse problem would occur in a multi- 
threaded program, because multiple traps could occur 
and have to be handled at the same time. Signal han- 
dling is not thread safe in general in Unix systems, so 
our obfuscation method cannot be used in an arbitrary 
multithreaded program. However, this is a limitation of 
Unix, not our method. 
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3.4 Attack Scenarios 


Recall that our goal is to make static disassembly diffi- 
cult enough to force any adversary to resort to dynamic 
techniques. Here we discuss why we believe our scheme 
is able to attain this goal. 


We assume that our approach is known to the adver- 
sary. As discussed in Section 3.2, the specifics of the ob- 
fuscation as applied to a particular program—the setup 
code, the kind of trap used for any particular control 
transfer, the code sequence used to generate traps, as 
well as the bogus control transfers inserted after the trap 
instruction—are chosen randomly. This makes it difficult 
for an adversary to identify the location of trap instruc- 
tions and bogus control transfers simply by inspecting 
the obfuscated code. 


Since locating the obfuscation code by simple inspec- 
tion is not feasible, the only other possibility to consider, 
for statically identifying the obfuscation instructions, is 
static analysis. This is difficult for two reasons. The first 
is the sheer number of candidates: for example, in prin- 
ciple any memory operation can raise an exception and is 
therefore potentially a candidate for analysis. Secondly, 
the problem of statically determining the values of the 
operands of such candidate instructions is difficult, both 
theoretically [22] and in practice, especially because, as 
discussed in Section 3.1, such operands need not be fixed 
constant values. Furthermore, if a byte sequence is en- 
countered in the disassembly that does not encode a legal 
instruction (and therefore cannot be subjected to static 
analysis), it can be either a part of the obfuscation (i.e., 
is “executed” and causes a trap), or it can be data embed- 
ded in the instruction stream: determining which of these 
is the case is in general an undecidable problem. 


4 Evaluation 


We measure the efficacy of obfuscation in two ways: by 
the extent of incorrect disassembly of the input, and by 
the extent of errors in control flow analysis of the dis- 
assembled input. These quantities are related, in the 
sense that an incorrect disassembly of a control trans- 
fer instruction will result in a corresponding error in the 
control flow graph obtained for the program. However, 
it is possible, in principle, to have a perfect disassem- 
bly and yet have errors in control flow analysis because 
control transfer instructions have been disguised as in- 
nocuous arithmetic instructions or bogus control trans- 
fers have been inserted. 


4.1 Evaluating Disassembly Errors 


We measure the extent of disassembly errors using a 
measure we call the confusion factor for the instructions, 
basic blocks, and functions. Intuitively, the confusion 
factor measures the fraction of program units (instruc- 
tions, basic blocks, or functions) in the obfuscated code 
that were incorrectly identified by a disassembler. More 
formally, let A be the set of all actual instruction ad- 
dresses, i.e., those that would be encountered when the 
program is executed, and let P be the set of all perceived 
instruction addresses, i.e., those addresses produced by 
a Static disassembly. Then A — P is the set of addresses 
that are not correctly identified as instruction addresses 
by the disassembler. We define the confusion factor CF 
to be the fraction of instruction addresses that the disas- 
sembler fails to identify correctly: 


CF = |A—P|/|A|. 


Confusion factors for functions and basic blocks are cal- 
culated analogously: a basic block or function is counted 
as being “incorrectly disassembled” if any of the instruc- 
tions in it is incorrectly disassembled. The reason for 
computing confusion factors for basic blocks and func- 
tions as well as for instructions is to determine whether 
the errors in disassembling instructions are clustered in a 
small region of the code, or whether they are distributed 
over significant portions of the program. 


4.2 Evaluating Control Flow Errors 


Two kinds of errors can occur when comparing the con- 
trol flow structure of the disassembled program Piisasm 
with that of the original program Poyig. First, Pyisasm May 
contain some edge that does not appear in Poyjg, i.€., the 
disassembler may mistakenly find a control flow edge 
where the original program did not have one. Second, 
Piisasm May not contain some edge that appears in Porig, 
i.e., the disassembler may fail to find an edge that was 
present in the original program. We term the first kind of 
error overestimation errors (written Ajye,) and the second 
kind underestimation errors (written Aynder), and express 
them relative to the number of edges in the original pro- 
gram. Let E,,ig be the set of control flow edges in the 
original program and Ezisasm the set of control flow edges 
identified by the disassembler, then: 


[Egisasmn ae orig| /|Eorig| 
|Eorig S. Edisasm|/ |Eorig| 


Aover = 
Aunder = 


2We also considered taking into account the set P— A of addresses 
that are erroneously identified as instruction addresses by the disas- 
sembler, but we rejected this approach because it “double counts” the 
effects of disassembly errors. 
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Even if we assume a perfect “attack disassembler” that 
does not incur any disassembly errors, its output will 
nevertheless contain control flow errors arising from two 
sources. First, it will fail to identify control transfers that 
have been transformed to trap-raising instructions. Sec- 
ond, it will erroneously identify bogus control transfers 
introduced by the obfuscator. We can use this to bound 
the control flow errors even for a perfect disassembly. 
Suppose that nap control flow edges are lost from a pro- 
gram due to control transfer instructions being converted 
to traps, and nyoeus bogus control flow edges are added 
by the obfuscator. Then, a lower bound on underestima- 
tion errors, min Aynder, 18 obtained when the only con- 
trol transfers that the attack disassembler fails to find are 
those that were lost due to conversion to trap instructions: 
min Aunder = Nap | Eorig. An upper bound on overesti- 
mation errors, max Aoyer, is obtained when every bogus 
conditional branch inserted by the obfuscator is reported 
by the disassembler: max Agver = Nbogus i Eorig- 


5 Experimental Results 


We evaluated the efficacy of our techniques using eleven 
programs from the SPECint-2000 benchmark suite.* Our 
experiments were run on an otherwise unloaded 2.4 GHz 
Pentium IV system with 1 GB of main memory running 
RedHat Linux (Fedora Core 3). The programs were com- 
piled with gcc version 3.4.4 at optimization level 

The programs were profiled using the SPEC training in- 
puts and these profiles were used to identify any hot spots 
during our transformations. The final performance of 
the transformed programs was then evaluated using the 
SPEC reference inputs. Each execution time reported 
was derived by running seven trials, removing the high- 
est and lowest times from the sampling, and averaging 
the remaining five. 


We experimented with three different “attack disas- 
semblers” to evaluate our techniques: GNU obj dump 
[24]; IDA Pro [11], a commercially available disassem- 
bly tool that is generally regarded to be among the best 
disassemblers available;* and an exhaustive disassem- 
bler by Kruegel et al. that was engineered to handle 
obfuscated binaries [15]. bjdump uses a straightfor- 
ward linear sweep algorithm, while IDA Pro uses recur- 
sive traversal. The exhaustive disassembler of Kruegel 
et al. takes into account the possibility that the input bi- 
nary may be obfuscated by not making any assumptions 
about instruction boundaries. Instead, it considers alter- 
native disassemblies starting at every byte in the code 
region of the program, then examines these alternatives 


3We did not use the eon programs from this benchmark suite be- 
cause we were not able to build it. 
4We used IDA Pro version 4.3 for the results reported here. 


using a variety of statistical and heuristic analyses to dis- 
card those that are unlikely or impossible. Kruegel et al. 
report that this approach yields significantly better dis- 
assemblies on obfuscated inputs than other existing dis- 
assemblers [15]; to our knowledge, the exhaustive disas- 
sembler is the most sophisticated disassembler currently 
available. 


In order to maintain a reasonable balance between 
the extent of obfuscation and the concomitant runtime 
overhead, we obfuscated only the “cold code” in the 
program—where a basic block is considered “cold” if, 
according to the execution profiles used, it is not exe- 
cuted. We evaluated a number of different combinations 
of obfuscations. The data presented below correspond to 
the combination that gave the highest confusion factors 
without excessive performance overhead: flip branches 
to increase the number of unconditional jumps in the 
code (see Section 3.2); convert all unconditional control 
transfers Gumps, calls, and function returns) in cold code 
to traps; insert bogus code after traps; and insert junk 
bytes after jmp, ret, and halt instructions. 


Disassembly Error 


The extent of disassembly error, as measured by confu- 
sion factors (Section 4.1) is shown in Figure 4(a). The 
results differ depending on the attack disassembler, but 
the results for each disassembler are remarkably consis- 
tent across the benchmark programs. Because we have 
focused primarily on disguising control transfer instruc- 
tions by transforming them into signal-raising instruc- 
tions, it does not come as a surprise that the straightfor- 
ward linear sweep algorithm used by the obj dump dis- 
assembler has the least confusion at 43% of the instruc- 
tions on average. However, these are spread across 68% 
of the basic blocks and 90% of the functions. The other 
disassemblers are confused to a much greater extent— 
55% for the exhaustive disassembler and 57% for IDA 
Pro, on average—but these are more somewhat more 
clustered as they cover only about 60% of the basic 
blocks and slightly fewer functions (89% and 85%, re- 
spectively). 


Overall, the instruction confusion factors show that 
a significant portion of each binary is disassembled in- 
correctly; the basic block and function confusion factors 
show that the errors in disassembly are distributed over 
most of the program. Taken together, these data show 
that our techniques are effective even against state-of- 
the-art disassembly tools. 


We have also measured the relative confusion factors 
for hot and cold instructions, i.e., those in hot versus cold 
basic blocks. For obj dump, the confusion factors are 
nearly identical at 42% of the hot instructions and 44% of 
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the cold instructions (again on average). The exhaustive 
disassembler was confused by fewer of the hot instruc- 
tions (35%) but more of the cold instructions (59%). IDA 
Pro did the best on hot instructions at 28% confusion, on 
average, but worst on cold instructions at 62% confusion. 
It is not surprising that Kruegel and IDA Pro did better 
with hot code, because we did not obfuscate it except to 
insert junk after hot unconditional jumps, and junk by 
itself should not confuse an exhaustive or recursive de- 
scent disassembler. Then again, these disassemblers still 
failed to disassemble about a third of the hot code. 


As an aside, we had thought that interleaving hot and 
cold basic blocks would cause more of the obfuscations 
in cold code to cause disassembly errors to “spill over” 
into succeeding hot code and increase the confusion 
there. This turns out to be the case for obj dump, which 
is especially confused by junk byte insertion. However, 
IDA Pro and the exhaustive disassembler are still able to 
find most hot code blocks. In fact, such interleaving in- 
troduces additional unconditional jumps in the code, e.g., 
from one hot block to the next one, jumping around the 
intervening cold code. The exhaustive disassembler and 
IDA Pro are able to find these jumps and use them to im- 
prove disassembly, resulting in less confusion when hot 
and cold code are interleaved. Morever, programs run 
more slowly when hot and cold blocks are interleaved 
due to poorer cache utilization. 


Control Flow Obfuscation 


Figure 4(b) shows the effect of our transformations in ob- 
fuscating the control flow graph of the program. The sec- 
ond column gives, for each program, the actual number 
of control flow edges in the original program. These are 
counted as follows: each conditional branch gives rise to 
two control flow edges; each unconditional branch (di- 
rect or indirect) gives rise to a single edge; and each func- 
tion call gives two control flow edges—one correspond- 
ing to a “call edge” to the callee’s entry point, the other 
to a “return edge” from the callee back to the caller. Col- 
umn 3 gives the number of control flow edges removed 
due to the conversion of control flow instructions to traps, 
while column 4 gives the number of bogus control flow 
edges added by the obfuscator. Columns 5 and 6 give, 
respectively, an upper bound on the overestimation error 
and a lower bound on the underestimation error. The re- 
maining columns give, for each attack disassembler, the 
extent to which it incurs errors in constructing the control 
flow graph of the program, as discussed in Section 4.2. 


It can be seen from Figure 4(b) that none of the three 
attack disassemblers tested fares very well at construct- 
ing the control flow graph of the program. bjdump 
fails to find over 63% of the control flow edges in the 


| EXECUTIONTIME (SECS) | 


ae Daa | Os | Sw eae 





Figure 5: Effect of Obfuscation on Execution Speed 


original program; at the same time, it reports over 71% 
spurious edges (relative to the number of original edges 
in the program) that are not actually present in the pro- 
gram. The exhaustive disassembler fails to find over 60% 
of the edges in the original program, and reports over 
27% spurious edges. IDA Pro fails to find over 63% of 
the control flow edges in the original program and reports 
over 41% spurious edges. Again the results for each dis- 
assembler are very consistent across the benchmark pro- 
grams. 


Also significant are the error bounds reported in 
columns 5 and 6 of Figure 4(b). These numbers indi- 
cate that, even if we suppose perfect disassembly, the re- 
sult would incur up to 85.5% overestimation error and at 
least 28.93% underestimation error. 


Execution Speed 


Figure 5 shows the effect of obfuscation on execution 
speed. For some programs—such as gap, gzip, mcf, 
parser, twolf, vortex, and vpr—the execution characteris- 
tics on profiling input(s) closely match those on the refer- 
ence input, so there is essentially no slowdown. (In fact, 
gzip ran faster after obfuscation; we believe this is due to 
a combination of cache effects and experimental errors 
resulting from clock granularity.) For other programs— 
such as crafty, gcc and perlbmk—the profiling inputs are 
not as good predictors of the runtime characteristics of 
the program on the reference inputs, and this results in 
significant slowdowns: a factor of 1.6 for crafty and gcc 
and 2.1 for perlbmk. The mean slowdown seen for all 
eleven benchmarks is 21%. 


We also measured the effect on execution speed of ob- 
fuscating a portion of the hot code blocks. Let 6 spec- 
ify the fraction of the total number of instructions ex- 
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(a) Disassembly Errors (Confusion Factor, %) 
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(b) Control Flow Errors (%) 
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Nap: control flow edges lost due to trap conversion 
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upper bound on overestimation errors 
lower bound on underestimation errors 


max Aoyer: 
min Aunder: 


Figure 4: Efficacy of obfuscation 


ecuted at runtime that should be accounted for by hot 
basic blocks. (The execution times in Figure 5 are for 
6 = 1.0, 1.e., all basic blocks with an execution count 
greater than 0 are consider hot.) If we run our obfus- 
cator with @ = 0.999—which means that, in addition to 
cold basic blocks, we obfuscate the hot basic blocks that 
account for just a tenth of a percent of the dynamic ex- 
ecution count—then the mean slowdown for the eleven 
benchmarks increases to 2.38. For smaller values of 0, 
the situation is far worse: at 8 = 0.99 the mean slow- 
down is 6.79, and at 8 = 0.9 the mean slowdown climbs 
to 43.39. The confusion factor increases somewhat when 
0 is decreased, but even at 8 = 0.9 the increase in confu- 
sion is less than 10% relative to the confusion at 8 = 1.0. 


Program Size 


Figure 6 shows the impact of obfuscation on the size of 
the text and initialized data sections. It can be seen that 
the size of the text section increases by factors ranging 
from 1.90 (crafty) to almost 2.1 (vortex), with a mean 
increase of a factor of 2.01. The relative growth in the 
size of the initialized data section is considerably larger, 
ranging from a factor of about 10 (crafty) to a factor of 
over 58 (twolf), with a mean growth of a factor of 26.46. 
The growth in the size of the initialized data is due to the 
addition of the mapping tables used to compute the type 
of each branch as well as its target address. However, 
this large relative growth in the data section size is due 
mainly to the fact that the initial size of this section is 
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Figure 6: Effect of Obfuscation on Text and Data Section Sizes 


not very large. When we consider the total increase in 
memory requirements due to our technique, obtained as 
the sum of the text and initialized data sections, we see 
that it ranges from a factor of 2.22 (crafty) to about 2.5 
(parser and vortex), with a mean growth of a factor of 
about 2.4. 


The increase in the size of the text section arises from 
three sources. The first of these is the code required to 
set up and raise the trap for each obfuscated control trans- 
fer instruction. The second is the junk bytes and bogus 
conditional branch inserted after a trap instruction. Fi- 
nally, there is the signal handler and restore code. In our 
current implementation, the first two of these sources— 
the setup code for a trap and bogus code inserted after 
a trap—introduces on average an additional 30 bytes of 
memory for each obfuscated control transfer instruction. 
This accounts for over 95% of the total increase in the 
text section size. Each obfuscated control transfer also 
adds three memory words (12 bytes) to the initialized 
data section, accounting for the increase in the size of 
this section. 


Stealth 


The stealth of an obfuscation refers to the difficulty of 
distinguishing code introduced by an obfuscator from the 
original program code. We estimate the stealth of our ob- 
fuscations by comparing the distributions of (i) opcodes 
and (ii) opcode pairs from adjacent instructions. The 
points of comparison are the original program, the ob- 
fuscated program (for which we know the actual instruc- 
tions), and the instructions observed by the different at- 
tack disassemblers we tested. The idea is to see whether 
the obfuscation process introduces any atypical instruc- 
tion sequence signatures that could be used to identify 
the obfuscation code statically. 


Figure 7 shows the distribution of individual opcodes. 
Because of the large number of opcodes in the x86 archi- 
tecture, we have grouped the commonly used opcodes 
into 27 sets—moves, conditional jumps, and so on—and 
also added sets for illegal instructions and all other in- 
structions. The 29 different sets of opcodes are given 
along the x-axis (“jc” denotes conditional jumps), while 
the y-axis shows the percentage of instructions in any 
program accounted for by that opcode. The y-axis uses a 
logarithmic scale to in order to magnify the data ranges 
for small percentage values. Each data point was ob- 
tained by computing the opcode distribution for each of 
the SPECint-2000 benchmarks in our test suite: the orig- 
inal binary, the distribution in the obfuscated program 
obtained from the obfuscator, and the distributions for 
the disassemblies produced by each of the attack disas- 
semblers. We then used these data to compute the arith- 
metic mean, minimum, and maximum values of instruc- 
tion percentages for that opcode across all of our test 
programs. For each opcode, the dark line in Figure 7 
shows the mean instruction percentage for each opcode, 
together with the range of its values. This line is then 
followed immediately to the right by a set of lighter lines 
that show the values for the obfuscated programs as well 
as those in the disassemblies obtained from the attack 
disassemblers. 


Figure 7 illustrates that, in most cases, the mean value 
of each opcode’s range in the obfuscated code is within 
the range of values in the unobfuscated benchmark code. 
Calls, returns, and jumps are somewhat less frequent for 
the obvious reason that we obfuscated many of those in- 
structions. Conditional jumps are somewhat more fre- 
quent because we added these to bogus code. On bal- 
ance, however, there are no obvious outliers that an at- 
tacker could use to use as a signature for where obfusca- 
tions occur. 
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Figure 7: Obfucation Stealth I: Distribution of Individual Opcodes 


Figure 8 shows the distribution of pairs of adjacent op- 
codes, not including pairs involving illegal opcodes. Due 
to space constraints, we show the data for just one attack 
disassembler, IDA Pro; the data are generally similar for 
the other attack disassemblers. Figure 8(a) shows the ac- 
tual distribution of opcode pairs in the obfuscated code, 
while Figure 8(b) shows the distribution for the disas- 
sembly obtained from IDA Pro. To reduce visual clutter 
in these figures, we plot the ranges of values for each 
opcode pair in the unobfuscated code (the dark band run- 
ning down the graph), but only the mean values for the 
obfuscated code. 


There are two broad conclusions to be drawn from Fig- 
ure 8. First, as can be seen from Figure 8(a), the actual 
distribution of adjacent opcode pairs in the obfuscated 
code is, by and large, reasonably close to that of the orig- 
inal code; however, there are a few opcode pairs, very of- 
ten involving conditional jumps, that occur with dispro- 
portionate frequency. The selection of obfuscation code 
to eliminate such atypical situations is an area of future 
work. The second conclusion is that, as indicated by Fig- 
ure 8(b), the opcode-pairs in the obfuscated code are sig- 
nificantly more random than in the unobfuscated code, 
partly because of disassembly errors caused by “junk 
bytes” inserted by our obfuscator. The outliers in this 
figure might serve as starting points for an attacker, but 
there are dozens of such points, they correspond to thou- 
sands of actual opcode pairs in the program, and there is 
no obvious pattern. 


In summary, the individual opcodes and pairs of adja- 
cent opcodes have approximately similar distributions in 
both unobfuscated and obfuscated programs. Thus, our 
obfuscation method is on balance quite stealthy. 


6 Related Work 


The earliest work on the topic of binary obfuscation that 
we are aware of is by Cohen, who proposes overlapping 
adjacent instructions to fool a disassembler [7]. We are 
not aware of any actual implementations of this proposal, 
and our own experiments with this idea proved to be dis- 
appointing. More recently, we described an approach to 
make binaries harder to disassemble using a combination 
of two techniques: the judicious insertion of “junk bytes” 
to throw off disassembly; and the use of a device called 
“branch functions” to make it harder to identify branch 
targets [20]. These techniques proved effective at thwart- 
ing most disassemblers, including the commercial IDA 
Pro system. Conceptually, this paper can be seen as ex- 
tending this work by disguising control transfer instruc- 
tions and inserting misleading control transfers. More re- 
cently, we described a way to use signals to disguise the 
instruction used to make system calls (‘int $0x80’ in In- 
tel x86 processors), with the goal of preventing injected 
malware code from finding and executing system calls; 
this work required kernel modifications. By contrast, the 
work described in this paper is applicable to arbitrary 
control transfers in programs and does not require any 
changes to the kernel. These two differences lead to sig- 
nificant differences between the two approaches in terms 
of goals, techniques, and effects. 


There has been some recent work by Kapoor [14] and 
Kruegel et al. [15] focusing on disassembly techniques 
aimed specifically at obfuscated binaries. They work 
around the possibility of “junk bytes” inserted in the in- 
struction stream by producing an exhaustive disassem- 
bly for each function, i.e., where a recursive disassem- 
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Figure 8: Obfucation Stealth II: Distribution of Opcode Pairs 


bly is produced starting at every byte in the code for that 
function. This results in a set of alternative disassem- 
blies, not all of which are viable. The disassembler then 
uses a variety of heuristic and statistical reasoning to rule 
out alternatives that are unlikely or impossible. To our 
knowledge, these exhaustive disassemblers are the most 
sophisticated disassemblers currently available. One of 
the “attack disassemblers” used for our experiments is 
an implementation of Kruegel et al.’s exhaustive disas- 
sembler. 


There is a considerable body of work on code obfus- 
cation that focuses on making it harder for an attacker 
to decompile a program and extract high level semantic 
information from it [9, 10,31,32]. Typically, these au- 
thors rely on the use of computationally difficult static 
analysis problems—e.g., involving complex Boolean ex- 
pressions, pointers, or indirect control flow—to make it 
harder to construct a precise control flow graph for a pro- 
gram. Our work is orthogonal to these proposals, and 
complementary to them. We aim to make a program 
harder to disassemble correctly, and to thereby sow un- 
certainty in an attacker’s mind about which portions of a 
disassembled program have been correctly disassembled 
and which parts may contain disassembly errors. If the 
program has already been obfuscated using any of these 
higher-level obfuscation techniques, our techniques add 
an additional layer of protection that makes it even harder 
to decipher the actual structure of the program. 


Even greater security may be obtained by maintain- 
ing the software in encrypted form and decrypting it as 
needed during execution, as suggested by Aucsmith [1]; 


or by using specialized hardware, as discussed by Lie et 
al. [19]. Such approaches have the disadvantages of high 
performance overhead (in the case of runtime decryption 
in the absence of specialized hardware support) or a loss 
of flexibility because the software can no longer be run 
on stock hardware. 


7 Conclusions 


This paper has described a new approach to obfuscating 
executable binary programs and evaluated its effective- 
ness on programs in the SPECint-2000 benchmark suite. 
Our goals are to make it hard for disassemblers (and hu- 
mans) to find the real instructions in a binary and to give 
them a mistaken notion of the actual control flow in the 
program. To accomplish these goals, we replace many 
control transfer instructions by traps that cause signals, 
inject signal handling code that actually effects the orig- 
inal transfers of control, and insert bogus code that fur- 
ther confuses disassemblers. We also use randomization 
to vary the code we insert so it does not stand out. 


These obfuscations confuse even the best disassem- 
blers. On average, the GNU obj dump program [24] 
misunderstands over 43% of the original instructions, 
over-reports the control flow edges by 71%, and misses 
63% of the original control flow edges. The IDA Pro 
system [11], which is considered the best commercial 
disassembler, fails to disassemble 57% of the original in- 
structions, over-reports control flow edges by 41%, and 
under-reports control flow edges by 85%. A recent dis- 
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assembler [15] that has been designed to deal with ob- 
fuscated programs fails to disassemble over 55% of the 
instructions, over-reports control flow edges by 27%, and 
under-reports control flow edges by over 60%. 


These results indicate that we successfully make it 
hard to disassemble programs, even when we only ob- 
fuscate code that is in cold code blocks. If we obfuscate 
more of the code, we can confuse disassemblers even 
more. However, our obfuscation method slows down 
program execution, so there is a tradeoff between the de- 
gree of obfuscation and execution time. When we ob- 
fuscate only cold code blocks, the average slow-down is 
21%, and this result is skewed by three benchmarks for 
which the training input is not a very good predictor for 
execution on the reference input. On many programs, 
the slowdown is negligible. An interesting possibility— 
which we have not explored but could easily add to our 
obfuscator—would be selectively to obfuscate some of 
the hot code, e.g., that which the creator of the code es- 
pecially wants to conceal. 
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Zero: 0 — 0-40 

0 — aa 

0 — Oa 

0 — &k O:a 
Nonzerok: k — 0 

k —> mf{rotl, rotr}n 

k — &k n 

k — min 
Arbitraryx: x — O x 

x — x JI 

x — O y:x 

x —> m x:y 


k £0; arbitrary a 


m # 0, and any n 

(rotr = rotate right; rot1 = rotate left) 

n= w-—~m, where m is the position of the 

least significant ‘1’ bit in k, and w is the machine 
word size in bits. 


meén 


y is any value 
y is any value; m #0 


Figure 9: A (non-exhaustive) list of rewriting rules (Operators are as in the C language) 


The rewritten expression may contain “free variables,” 
i.e., variables that are not initialized to any value. The 
value of the overall expression does not depend on the 
actual value taken on by such a free variable, so any value 
will do. In our implementation, we simply use the con- 
tents of any arbitrary register or legal memory location 
for such variables. 


Once the rewritten expression has been generated, 
we generate code for it via a straightforward post-order 
traversal of the final syntax tree. 
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Abstract 


We introduce the first active hardware metering scheme 
that aims to protect integrated circuits (IC) intellectual 
property (IP) against piracy and runtime tampering. The 
novel metering method simultaneously employs inherent 
unclonable variability in modern manufacturing tech- 
nology, and functionality preserving alternations of the 
structural IC specifications. Active metering works by 
enabling the designers to lock each IC and to remotely 
disable it. The objectives are realized by adding new 
states and transitions to the original finite state machine 
(FSM) to create boosted finite state machines(BFSM) of 
the pertinent design. A unique and unpredictable ID gen- 
erated by an IC is utilized to place an BFSM into the 
power-up state upon activation. The designer, knowing 
the transition table, is the only one who can generate in- 
put sequences required to bring the BFSM into the func- 
tional initial (reset) state. To facilitate remote disabling 
of ICs, black hole states are integrated within the BFSM. 

We introduce nine types of potential attacks against 
the proposed active metering method. We further de- 
scribe a number of countermeasures that must be taken 
to preserve the security of active metering against the po- 
tential attacks. The implementation details of the method 
with the objectives of being low-overhead, unclonable, 
obfuscated, stable, while having a diverse set of keys 
is presented. The active metering method was imple- 
mented, synthesized and mapped on the standard bench- 
mark circuits. Experimental evaluations illustrate that 
the method has a low-overhead in terms of power, de- 
lay, and area, while it is extremely resilient against the 
considered attacks. 


1 Introduction 


In the dominant horizontal semiconductor business 
model, piracy (illegal copying) and tampering of hard- 
ware are omnipresent. In the horizontal business model, 
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hardware IP! designed by the leading edge designers 
are mostly manufactured in untrusted offshore countries 
with lower labor and operational cost. This places the 
designers in an unusual asymmetric relationship: the de- 
signed IP is transparent to the manufacturers, but the 
fabrication process, quantity and added circuitry to the 
manufactured integrated circuits (ICs) by the foundry are 
clandestine to the designers and IP providers. 

The security threat, financial loss and economic im- 
pacts of hardware piracy which have received far less 
attention compared to software, is even more dramatic 
than software [8, 31]. Software piracy has received more 
attention compared to hardware also because it requires 
low-cost resources that are available to the general pub- 
lic. Protection of hardware is also crucially important 
because the ICs are pervasively used in almost all elec- 
tronic devices and the potentially adversarial fabrication 
house has the full control over the hardware resources 
being manufactured. It is estimated that the computer 
hardware, computer peripherals, and embedded systems 
are the dominant pirated IP components [31]. 

Several other issues make the IC protection problems 
truly challenging: (i) very little is known about the cur- 
rent and potential IC tampering attacks; (ii) numerous 
attacking strategies exist, since tampering can be con- 
ducted at many levels of abstraction of the synthesis pro- 
cess; (iii) the most likely hardware adversaries are fi- 
nancially strong foundries and foreign governments with 
large economic resources and technological expertise; 
(iv) the adversary has full access to the structural specifi- 
cation of the design and most often also to the manufac- 
turing test vectors; (v) the internal part of manufactured 
ICs are intrinsically opaque. While it is possible to to- 
mographically scan an IC, the dense metal interconnect 
in 8 or more layers of modern manufacturing technol- 
ogy greatly reduce the effectiveness of such expensive 
inspections. 

IC metering is a set of security protocols that enable 
the design house to gain post-fabrication control by pas- 
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sive or active count of the produced ICs, their properties 
and use, or by remote runtime disabling. 

Our strategic goal is the development, implementa- 
tion, and quantitative evaluation of symmetric mecha- 
nisms and protocols for hardware protection procured 
by untrusted synthesis, manufacturing, and/or testing fa- 
cilities. The term symmetric emphasizes that both the 
designers and the foundry will be protected by the new 
methods. The symmetry is warranted by the unique vari- 
abilities and the key exchange mechanism that is based 
on the agreement of both parties for unlocking each IC. 

Hardware metering is important from both commer- 
cial and military point of views. For example, without 
metering, a foundry can produce numerous copies of one 
design without paying royalties, or, as another example, 
the sensitive defense designs may become available to 
adversaries. The passive hardware metering schemes 
work by giving a unique ID to each chip [17, 20, 21]. The 
Jjirst ever active hardware metering method introduced in 
this paper, provides not just mechanisms for detection of 
illegal copies, but more importantly, ensures that no man- 
ufactured IC can be used without the explicit consent of 
the designer. 

The proposed methods employ two generic security 
mechanisms: (1) uniqueness of each IC due to manufac- 
turing variability; and (2) structural manipulation of the 
design specification while preserving behavioral spec- 
ification. While the first mechanism has been already 
proposed and used for unique IC identification, the sec- 
ond is novel. Even more novel is the integration of two 
mechanisms, a task that requires a great deal of creativity 
and formation of solutions to a spectrum of challenging 
technology, synthesis and optimization problems, with a 
greater impact than the sum of the powers of the individ- 
ual techniques. 

The integration to the functionality is performed by 
interwinding the unique unclonable IDs for each chip 
into the FSM of the design. The integrated control part 
is denoted by BFSM, and is built by adding new states 
and transitions to the original FSM, while preserving the 
original functionality of the circuit. To bring the BFSM 
into the functional initial (reset) state, knowledge of the 
transition table is required. Since the designer is the 
only one who knows this information, no one else can 
generate a key with a finite amount of resources to un- 
lock the IC. Using a combination of BFSM and newly 
added black hole states, remote disabling of the ICs can 
be made possible. We outline several possible attacks 
against the introduced active hardware metering method 
and provide mechanisms that neutralize the impact of 
those attacks. For example, we show how addition of the 
black hole states disable the random guessing attacks. 

The remainder of the paper is as follows. After de- 
scribing the background, flow and the state-of-the-art in 


the next two sections, we represent the active metering 
method in Section 4. In Section 5, we show a low- 
overhead implementation and obfuscation of active me- 
tering. Section 6 introduces potential attacks and the 
countermeasures that needs to be taken to be resilient 
against the attacks. We present experimental evaluation 
of the prototype implementation on several standard de- 
sign benchmarks in Section 7. We outline a number of 
potential applications in Section 8 and conclude in Sec- 
tion 9. 


2 Preliminaries 


In this section, we describe the necessary background re- 
quired for understanding the active hardware metering 
approach. The aim is to make the paper self-contained 
for the readers who are not familiar with the hardware de- 
sign and synthesis process. Next, we describe the global 
flow of the active hardware metering approach. 


2.1 Background 


Manufacturing variability (MV). The intense indus- 
trial miniaturization of CMOS devices has been driven 
by the quest for increasing computational speed and de- 
vice density, while lowering cost-per-function, as pre- 
dicted by Moore’s law. CMOS variations result in high 
variability in the delay and the currents of the VLSI cir- 
cuits. The variations might be temporal or spatial. The 
temporal variations may occur across nanoseconds to 
years [24]. Spatial variation is due to lateral and verti- 
cal differences from intended polygon dimensions and 
film thicknesses . Spatial variation may be intra-die, 
or inter-die [27]. Aside from device variations, the cir- 
cuit response and its variability are correlated with cir- 
cuit topology. We will utilize the spatial variations in our 
benefit, while we address the problem of alleviating tem- 
poral variability. Bernstein et al. provide a classification 
of device variations (beyond 65nm) [4]. 


Design descriptions. We consider the case in which 
the sequential design in question represents a fully syn- 
chronous flow and that the description of its functionality 
from an input/output (I/O) perspective is publicly avail- 
able. We assume that the functionality is fully fixed, in 
that the I/O behavior is fully specified. Therefore, we uti- 
lize unique unclonable identification to embed a distinct 
mark in the functionality of each IC, without altering the 
functionality in terms of the normal I/O behavior of the 
circuit. Our technique is applicable to the case where the 
piece of IP is available in structural HDL description, or 
in form of a netlist that may or may not be technology 
dependent. The description uniquely defines the sequen- 
tial circuit’s behavior and the state transition graph (see 
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Figure 1: Example of a STG with five states. The inputs 
required for state-to-state transition are shown next to the 
edges. 


the next subsection) of the design. 

During the design flow, the user will take such a de- 
scription and if required, will map it to a specific technol- 
ogy. Typically, logic level optimizations such as retiming 
are performed at this stage. Most often, the circuit is used 
as a part of a more complex design. 


Finite state machine (FSM). FSM is a discrete dynam- 
ical structure that translates sequences of input vectors 
into sequences of output vectors. FSM can represent any 
regular sequential function. It appears in different forms, 
e.g. case statements in VHDL and Verilog HDL. The 
FSM is defined by the tuple M=(%,A,Q,qo,6,X), where 
& # 0 and A ¥ O area finite set of inputs and outputs 
symbols respectively; Q={qo.q1,... } 0 is a finite set 
of states while go is the reset” state; and the transition 
function is denoted as 6(qg, a) on the input a and the set 
Q x = Q, while the output function is denoted as » 
(q, a) on the setQ x NA, 

To represent the state transitions and output functions 
of the FSM, we use the state transition graph (STG), with 
nodes corresponding to states and edges defining the in- 
put/output conditions for a state-to-state transition. An 
example STG is shown in Figure 1, where there are five 
states {Go, G1, G2, 93, qa}, Go is the reset state, and there 
is a one-bit input controlling the state-to-state transitions. 
In the remainder of the paper, we use the terms STG and 
FSM interchangeably to refer to the control part of the 
design. 


2.2. Global flow 


As a motivational example for our problem, consider the 
scenario in which a given hardware intellectual property 
(IP) that belongs to its legitimate owner (Alice) is made 
available to a fabrication house (Bob). Alice pays for and 
demands Ny ICs implementing its design. Bob, utilizes 
the IP description to construct a mask that implements 
the design. Bob employs the mask to make N4 + Ng 
copies of the design, where the illegal Ng copies do not 
encounter much additional cost due to the availability of 


Bob 







FSM (STG) 
Extraction 


Unlock & 
activate 


Figure 2: The global flow of the active hardware meter- 
ing approach. 


the mask. Bob may sell the Ng illegal copies and make 
a lot of profit with negligible additional overhead. 


The novel active metering helps Alice to protect her 
design against piracy by manipulating the STG of the 
original design, with the objective of creating a locked 
state, that is unique for each of the ICs manufactured 
from the design with a very high probability. Upon man- 
ufacturing by Bob, each device will be uniquely locked 
(i.e., rendered non-functional), unless Alice is contacted 
by Bob to provide the particular key to unlock the IC. 
The scheme gives the full control over the manufactured 
parts and operational devices from the IP to Alice. 


The global flow of the active hardware metering 
method is shown in Figure 2. We now describe the fig- 
ure step by step. Alice takes the high level design de- 
scription and synthesizes it to get the FSM of the design. 
Next, she constructs the BFSM by adding extra states. 
After that, she sends the detailed manufacturable design 
specifications to Bob who makes the mask and manufac- 
tures multiple ICs implementing the design. The manu- 
factured ICs are locked (nonfunctional) at this stage. For 
each IC, Bob reads out the values in its flip flops (FFs) 
and sends the values to Alice. FF values can be read 
nondestructively, and the values are unique for each IC. 
Alice, knowing the BFSM structure, computes a specific 
key that can be used as input to that IC for unlocking 
it. The key is then sent back to Bob who utilizes it to 
activate the IC. 
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3 Related work 


We survey the related literature that has influenced and 
inspired this work along four main lines of research: 
variability-based ID generation, authentication and secu- 
rity by variability-based IDs, intellectual property pro- 
tection of VLSI designs, and invasive and noninvasive 
hardware attacks. 

A number of authors have proposed and implemented 
the idea of addition of circuitry that exploits manufactur- 
ing variability to generate unique random sequence (ID) 
for each chip with the same mask [20, 21,28]. The IDs 
are unclonable and separated from the functionality and 
do not provide a measure of trust, as they are easy to tam- 
per and remove. Loftstrom et al. proposed a method for 
mismatching the devices based on changing the threshold 
of the circuits by placing the impurity of random dopant 
atoms [20]. Maeda et al. proposed implementing the ran- 
dom IDs on poly-crystalline silicon thin film transistors 
[21]. The drawback of the two described approaches is 
that they both need specialized process technology, and 
are easily detectable. Very recently, Su et al. have pro- 
posed a technique to generate random IDs by using the 
threshold mismatches of two NOR gates that are posi- 
tively feeding back each other [28]. We will exploit their 
technique for the random ID generation. 

A team of researchers has explored the idea of using 
variability-induced delays for authentication and security 
[9, 19,29]. They use Physically Unclonable Functions 
(PUFs) that map a set of challenges to a set of responses, 
based on an intractably complex physical system. PUFs 
are unique, since process variations cause significant de- 
lay differences among ICs coming from the same mask. 
For each IC, a database of challenge-response sets is 
needed. Authentication occurs when the IC correctly 
finds the output of one or more challenge inputs. PUF- 
based methods solely utilize manufacturing variability 
as their security mechanism. In contrast, our proposed 
methods introduce a paradigm shift in hardware security 
by adding new strong mechanisms: integration into cir- 
cuit functionality at the behavioral synthesis level. Fur- 
ther more, even though the active metering methods can 
be utilized for authentication, its main target is address- 
ing the hardware piracy problem. 

Koushanfar et al. have introduced the first hardware 
metering scheme that gives unique IDs to each IC [16]. 
The scheme was to make a small part of the design pro- 
grammable so that one could upload different control 
paths post fabrication. They further described how to 
generate numerous different instances of the same con- 
trol path with the same hardware [17]. They have also 
provided probabilistic proofs for the number of identical 
copies and probability of fraud for the proposed meter- 
ing schemes [16,17]. All metering schemes were pas- 


sive. Indeed, no active metering scheme has been pro- 
posed to date. The prior work in trusted IC domain also 
includes introduction of several watermarking schemes 
that integrate watermarks to the functionality of the de- 
sign at the behavioral synthesis level [11—13, 15, 22, 23, 
30, 32]. Watermarking is a fundamentally different prob- 
lem when compared to metering. It addresses the prob- 
lem of uniquely identifying each IP and not identifying 
each IC, so the existence of the same mask does not af- 
fect the watermarking results. Fingerprinting for unique 
identification of programmable platforms has been pro- 
posed [18], but the techniques are not applicable to ap- 
plication specific designs (ASICs) due to the existence of 
a unique mask. Qu and Potkonjak provide a comprehen- 
sive survey of the watermarking, fingerprinting and other 
hardware intellectual property protection methods [23]. 

Even though many strong cryptographical techniques 
are available in hardware and software, their attack 
resiliency has been only verified by classical crypto- 
analysis methods. A class of attacks that is very chal- 
lenging to address consists of physical techniques. Phys- 
ical attacks take advantage of implementation-specific 
characteristics of cryptographical devices to recover the 
secret parameters. Koeune and Standaert provide a tu- 
torial on physical security and side-channel effects [14]. 
The physical attacks are divided into invasive and non- 
invasive [3]. Invasive attacks depackage the chip to get 
direct access to its inside, e.g., probing. Noninvasive at- 
tacks rely on outside measurements, e.g., from the pins 
or by X-raying the chip, without physically tampering it. 

There are multiple ways to attack an IC, including 
probing, fault injection, timing, power analysis, and elec- 
tromagnetic analysis. Invasive attacks are typically more 
expensive than the noninvasive ones, since they need in- 
dividual probing of each IC. Note that, according to the 
well-established taxonomy of physical attacks, attacks 
by the funded organizations (e.g., foundries) are the most 
severe ones, since they have both the funding and tech- 
nology resources [1-3]. 


4 Active hardware metering 


In this section, we present the details of the active hard- 
ware metering approach. Active metering is integrated 
into the standard synthesis flow, and is low overhead, 
generalizable, and resilient against attacks. By gener- 
alizable, we mean that the lock can be implemented on 
structures that are common to all designs. By attack- 
resiliency, we mean the cryptographic notion of a lock: 
that an attacker that does not have infinite computational 
power should not be able to unlock the IC without the 
knowledge of a key. To be generalizable, the method 
proposed here aims at protecting the design by boost- 
ing the design’s FSM (and creating a BFSM) common to 
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the widely used class of sequential designs. In this sec- 
tion, we describe the BFSM construction and introduce 
the locking mechanism. Implementation details are dis- 
cussed in the next section. 


4.1 Method 


Random Unique Block (RUB). Perhaps the most im- 
portant component of the proposed security mechanism 
is the existence of the unclonable unique ID for each IC. 
The IDs are a function of the variability present at each 
chip and are therefore, specific to the chip. RUB is a 
small circuitry added to the design, whose function is 
to generate the unique ID. It is desirable that the RUBs 
do not change and remain stable over time. Recently, 
a few paradigms for designing unique identification cir- 
cuitry was proposed [20, 21, 28]. The resulting IDs are 
mostly stable, and we will later show how to extract a 
nonvolatile ID from the RUB, even in presence of a few 
unstable bits. 

Addition of the BFSM. The key idea underlying the 
proposed active metering scheme can be described in a 
simple way. Assume that the original design contains m 
distinct states. Further assume that the state of STG are 
stored in k, 1-bit flip flops (FFs). The FFs represent a to- 
tal of 2* states, out of which m states correspond to the 
original design and (2* — m) states are dont cares. The 
metering mechanism adds an extra part to the FSM of the 
design. The added states are devised such that there are 
a number of transitions from the states in the added STG 
to the reset state go of the original design. 

In our scheme, the power-up state of each IC is built to 
be a function of the manufacturing variability and thus, 
will be unique to each instance. Furthermore, we select 
k such that 2* —m >> m. This selection ensures that 
when the circuit is powered up, its initial state will be in 
one of the added states in BFSM. Assume that the IC is 
powered up in the added state gag. During the standard 
testing phase, the manufacturer can read the state of the 
design, e.g., by scanning and reading the FF’s. However, 
unless the foundry has the knowledge of the STG, finding 
the sequence of inputs required for the correct transition 
from the state qao to the reset state go is a problem of 
exponential complexity. Essentially, there will be no way 
of finding the sequence other than trying all the possible 
combinations. 

More formally, assume that the sequence of J pri- 
mary inputs denoted as a;={a, d2, ..., ar} applied to 
the state gao is one correct sequence of states that starts 
from qao traverses I states denoted by Q7={qa1, da2,--+> 
Ja(1—1)-40 }> 1-€., Go=d(qao, a). Assuming that the input 
is b bits and there are cycles in STG, finding the correct 
input sequence that would result in J consecutive cor- 
rect transitions is a problem with exponential complexity 
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Figure 3: The boosted FSM (BFSM). 


with respect to 6 and is thus, intractable. 


As an example, consider the STG shown in Figure 
3(b) that consists of the original STG that has five states 
({¢0.41,92.93,94}) with augmentation of twenty seven 
added states ({qs,d6,---.931}). Edges are incorporated 
to the added states to ensure that there are paths from 
each of the added states to the reset state of the design. 
The block shown in Figure 3(a) is a RUB. 


The output of the RUB defines & random bits that will 
be loaded into the FF’s of the augmented STG upon start- 
up. Now, an uninformed user who does not have the in- 
formation about the transition table (e.g., foundry), can 
readout the data about the initial added state gag, but this 
information is not sufficient for finding the sequence of 
primary input combinations to arrive at the reset state qo. 
However, the person who has the information about the 
structure of the STG, upon receiving the correct state, 
would exactly know how to traverse from this locked 
state to qo. In other words, the owner of the FSM de- 
scription is the only entity who would have the key to 
unlock the IC. 


An interesting application of the proposed BFSM con- 
struction method is in remote disabling. Alice will save 
the RUBs and the keys for all the ICs that she has un- 
locked. Using the chip IDs that are integrated within the 
functionality, she can add mechanisms that enable her to 
monitor the activities of the registered chips remotely, for 
example, if they are connected to the Internet. She can 
further add transitions from the original STG to untra- 
versed states, to lock the IC in case it is needed. Remote 
disabling has a lot of applications. For example, it can 
be used for selective remote programming of the devices, 
and royalty enforcement. 





USENIX Association 


16th USENIX Security Symposium 


295 


4.2 Ensuring proper operation 


The following issues and observations ensure proper op- 
eration and low-overhead of active hw metering: 


(i) Storing the input sequence (key) for traversal to 
the initial state go. During testing, once Bob scans out 
the FF values and sends them to Alice, she provides the 
key to Bob. He includes both the original RUB and the 
key in the chip, for example, in a nonvolatile memory. 
This data is utilized along with the unclonable RUB cir- 
cuit, for transition to the reset state. Since the power-up 
state is unique for each IC, sequence of inputs (key) that 
traverse the power-up state to the reset state is also spe- 
cific to each IC. One needs to store the key which per- 
forms the traversal at the power-up state on each chip. 
There are many ways to accomplish this. For exam- 
ple, the designer could add a small programmable part 
to the design which needs to be coded with the unique 
sequence (key) before each IC is in operation. Coding 
ensures protection of keys against other software attacks. 
As an alternative, the sequence might not be included in 
the memory and just used as a permanent password to 
the IC. 


(ii) Powering up in one of the added states. This 
condition can be easily guaranteed by selecting a large 
enough k. Assuming that all the states have an equal 
probability, the probability of starting in one of the added 
states is (2* — m)/2*. Fora given m, we select k such 
that the probability of not being in one of the added states 
is smaller than a given probability. For example, for 
m = 100 and k = 30, the probability of starting up 
in an original state is less than 107". 


(iii) Diversity of power-up states (unique IDs). & 
should also be selected so that the probability of two ICs 
having the same ID becomes very low. Assume that we 
need to have d distinct ICs each with a unique ID. As- 
suming that the IDs are completely random and indepen- 
dent, we utilize the Birthday paradox to calculate this 
probability and to make it low. Consider the probabil- 
ity Prorp(k, d) that no two ICs out of a group of d will 
have matching IDs out of 2* equally possible IDs. Start 
with an arbitrary chip’s ID. The probability that the sec- 
ond chip’s ID is different is (2* — 1)/2*. Similarly, the 
probability that the third IC’s ID is different from the first 
two is [(2* — 1)/2*].[(2* — 2)/2*]. The same computation 
can be extended through the 2-th ID. More formally, 


Prerp(k,d) = 


Thus, knowing d, the number of required distinct copies, 


and setting a low value for Prc7p, we would be able to 
find & that satisfies the above equation. 


(iv) Overhead of the added STG. The number of states 
increases exponentially with adding each new bit, and 
thus, the scheme has a very low overhead. Note that, 
in modern designs, the control path of the design (i.e., 
FSM) is less than 1% of the total area and hence, adding 
a small overhead to the FSM does not significantly affect 
the total area [7, 10]. In the next section, we will describe 
a low-overhead implementation of the proposed method. 


(v) Diversity of keys. There is a need to ensure that the 
keys are distinct in all parts of their sequences, or there is 
a very small shared subsequence between different keys. 
This is granted by making multiple paths on the graph 
from each of the states to the reset state. We will elabo- 
rate more on this issue in the attack resiliency section. 


5 Low overhead implementation and ob- 
fuscation 


In this section, we discuss the implementation details of 
the RUB and the BFSM that are the required building 
blocks for the active hardware metering approach. We 
start by outlining the desired properties of each block, 
and then we delve into its implementation details. 


5.1 RUB implementation 


A critical aspect of the proposed security and protection 
mechanisms is the generation of random ID bits. There 
are a number of properties that the RUB implementation 
has to satisfy, including: 


e Low overhead. The added parts must not introduce a 
significant additional overhead in terms of delay, power 
consumption and the area. 


e Distribution of IDs and their correlations. To have 
the maximal difference between any two ID numbers (the 
maximal Hamming distances) the ID bits must be com- 
pletely random. Thus, no correlation must be present 
among the ID bits on the same die or across various dies. 


e Indiscernibility. The [Ds must be integrated within the 
design, such that they cannot be discerned by studying 
the layout of the circuit. For example, the IDs should not 
be placed in a memory-like array, where the regularity of 
the array and its connections to the FFs could be easily 
detected. 


e Stability. There is a need to stabilize the IDs over 
the lifetime of an IC. This is particularly important since 
studies have shown the temporal changes in CMOS pro- 
cess variations due to many environmental and aging ef- 
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fects including, residual charges, self-heating, negative- 
bias temperature instability, and hot electron effects [4]. 

For implementing the random IDs, we employed the 
recent novel approach proposed by Su et al. [28]. They 
have designed and tested a new CMOS random ID gen- 
eration circuit that relies on digital latch threshold offset 
voltages. Using cross coupling of gates, they report sig- 
nificant improvement in readout speed and power con- 
sumption over the existing designs. 

Each ID bit is generated by cross-coupled NOR gates. 
The latch sides are pulled low initially. At the high to 
low clock transition, the state of each latch is determined 
by the threshold voltage mismatch of the transistors. Es- 
sentially, the approach relies on the positive feedback in- 
herent in the latch configuration to amplify the mismatch. 
This design removes the need for comparators, low offset 
amplifiers, or extra dopants needed in previous random 
ID generation methods [9,20]. The nominal overhead 
of the above proposed approach is two NOR gates per 
bit. The authors have reported 96% stable IDs using this 
method, while using dummy latches to protect the IDs. 

Even though we use the random bit architecture de- 
scribed above, our layout and implementation of random 
bits are very different. To be indiscernible, we do not 
place the coupled NOR gates in an array, and instead syn- 
thesize them with the rest of the circuit and camouflage 
them within the sea of gates. based on invariability of the 
ID bits for an IC. In Subsection 6.2, we provide a mech- 
anism that ensures the occasional errors in ID bits do not 
affect the hardware metering approach. 


5.2 BFSM implementation 


The key design objectives and challenges of the BFSM 
are as follows: 


e Low overhead. The addition of the states to the orig- 
inal FSM must have a low overhead in terms of area, 
power, and delay. This is particularly challenging: as 
we have computed in Subsection 4.2, even under the as- 
sumption of having RUBs with Uniform distribution of 
random bits, the number of added states must be expo- 
nentially high to ensure a proper operation. 


e Traversal path. There must be a path on the BFSM, 
from each of the power-up states (except for the black 
hole states that we will describe in Section 6.2) to the 
reset state. 


e States obfuscation. The states must be completely ob- 
fuscated and interchanged to camouflage the added STG 
and the original STG. Another level of obfuscation is dis- 
abling the observability of the FFs, so that similar states 
on two ICs do not exactly have the same code scanned 
out from their FFs. 
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Figure 4: Illustration of steps for building a sparse 3-bit 
STG. 


e Multiplicity of keys. It is highly desirable to construct 
the paths on the added STG in such a way that there are 
multiple paths from each power-up state to the reset state. 
This will ensure that there are multiple keys for traversal. 
Now, if the states are obfuscated such that a similar state 
on two ICs has different codes, and each of them gets a 
different key for traversal to the original STG, the state 
similarities will not be apparent, even to a smart observer. 

To achieve a low overhead, we have systematically de- 
signed STG blocks that are capable of producing an ex- 
ponential number of states with respect to their under- 
lying hardware resources. The blocks are designed such 
that there are multiple paths from each of the added states 
to the reset state and thus, the multiplicity of the keys is 
satisfied. Our first attempt was to synthesize the added 
blocks of STG and the original STG together. However, 
because the synthesis software automatically optimizes 
the interwoven architecture, it most often ended up with 
a combined STG that was much larger than the sum of 
its components. Thus, we decided to first separately syn- 
thesize the original and the added STG before we merge 
them. Next, we employed obfuscation methods that con- 
stantly alter the values of the FFs, even those that are not 
used in state assignment in the current STG. As we will 
see in attack resiliency section, the introduced obfusca- 
tion method has the side-benefit that the adversary cannot 
exactly distinguish a similar state on two different ICs. 

The added STG can be designed to be low overhead; 
there are exponentially many states for each added FF, ig- 
noring the overhead of the STG edges. However, in real 
situations, the transitions (edges) require logic. Thus, the 
added STG is constrained to be sparse to satisfy a low 
overhead. We have built this block in a modular way. We 
describe one of our modules here and then discuss sys- 
tematically interconnecting the modules to have a multi- 
bit added STG that has a low overhead. 

The first module is a 3-bit added STG. In Figure 4, we 
show three steps for building this module. We start by 
a ring counter as shown in Graph 4(a). Next, we pick 
a few states and reconnect them to break the regularity. 
A small example is illustrated in Graph 4(b), where the 
state q; is reconnected, such that still there will be a path 
from each state to any other state. Finally, we add a few 
transitions (edges) to the STG, like the example shown 
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in Graph 4(c); here the states q; and q4 are reconnected, 
while the edges { q4 > 91, 97 > 93, 97 > 97, 92 > 
} are added. 

The example is just an illustration. Many other con- 
figurations are possible. The various combinations have 
different post-synthesis overhead. To ensure a low- 
overhead, we exhaustively searched the synthesized 3- 
bit structure with various sparse edge configurations like 
the example above, and selected the configurations with 
the lowest overhead as our 3-bit modules. As it is ap- 
parent from the structure, many low-overhead configu- 
rations are possible and we do not need to use the same 
module multiple times. 

After that, we picked the low overhead modules and 
started to add edges to interconnect them, such that the 
connectivity property is satisfied, and the interconnected 
configuration still has low overhead. Furthermore, we 
need multiple interconnecting paths that can produce 
multiple keys. This is again done via a modular random- 
ized edge addition and searching the space of the syn- 
thesized circuits to find the best multi-bit configurations. 
Note that, the synthesis program performs state-encoding 
for the interconnected modules. We have noticed that the 
distance of the codes assigned to the states does not have 
a correlation with the proximity of the states. Therefore, 
even for two RUBs that are only different in 1-bit, typ- 
ically the power-up states are not close-by on the added 
STG. 

In our experiment, we have tested our approach on 12, 
15, and 18-bit added STGs. Now, the original STG has 
to be glued to the added part. This is done by an ob- 
fuscation scheme that ensures the states of FFs that are 
associated with the original STG keep pseudorandomly 
changing, even when we traverse the states of the added 
STGs. Thus, for an observer who studies the values of 
the interleaved FFs, the activity study would not yield 
an informative conclusion that can help separating the 
original and the added states. A simple example for this 
obfuscation is depicted in Figure 5. In this figure, a small 
original STG with five states is presented. The cloud 
shown below the original STG indicates the added states. 

There are multiple state transitions from the added 
states to the original state. However, we only show one 
arrow on the plot not to make it more crowded. In this ex- 
ample, we use the three don’t cares of the design for ob- 
fuscation purposes. There are 3 don’t care states that we 
use to form three new dummy states q*5, g*g, and q*7, 
illustrated in grey color. The glue logic attaches the in- 
puts and the states of the added STG to the dummy STG. 
Thus, by carefully designing, one can alter the bits on 
the dummy STG by changing the input and the states of 
the added STG without touching the original FSM. If the 
design does not have sufficient don’t cares, we can adda 
couple of FFs for the dummy states and use the same 
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Figure 5: Obfuscation of the original STG. 


paradigm. The important requirement for the dummy 
states is that as a group they should present both | and 
0 digits in all FFs. The original STG is also connected 
to the dummy STG and can utilize it as a black hole (de- 
scribed more thoroughly in Subsection 6.2), if there is a 
need to halt the IC. 


6 Attack resiliency 


This section first identifies several types of potential at- 
tacks on the active hardware metering approach. Next, 
we outline a number of mechanisms that must be added 
to the basic active metering scheme to ensure its re- 
siliency against the suggested attacks. 

The adversary (Bob) may attempt to perform a set of 
invasive or noninvasive attacks on the proposed active 
metering scheme. Bob may do so by measuring and 
probing one instance, or by statistically studying a col- 
lection of instances. In this section, we first identify and 
describe the attacks. Next, we propose efficient coun- 
termeasures that can be taken to neutralize the effect of 
potential adversarial acts. 

We assume that Bob knows all the concepts of the 
proposed hardware metering scheme, has the complete 
knowledge of the design at all levels of abstraction pro- 
vided to the foundry (e.g., logic synthesis level netlist, 
and physical design GDS-II file, but no behavioral spec- 
ification), can simultaneously observe all signals (data) 
on all interconnects and flip-flops (FFs), and can mea- 
sure, with no error, all timing characteristics of all gates 
in the ICs. 


6.1 Description of attacks 


The starting point for development and evaluation of the 
metering schemes is identification and specification of 
several types of potential attacks: 


(i) Brute-force attack. Bob aims to place the pertinent 
IC into the initial state by systematically applying the 
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input sequences to the BFSM. The systematic applica- 
tion may be a randomized strategy, or may be based on 
scanning the FFs. Brute-force attack works by randomly 
changing the inputs in hope of arriving at the reset state. 
Scanning works by reading out the FF values for a few 
ICs and storing them. The FFs in the current IC are then 
monitored for the existence of a common state with the 
stored ones. In case a state that was read in the previous 
ICs is reached, Bob uses the same key for traversal to the 
reset state. 


(ii) Reverse engineering of FSM. Bob may try to scan 
the FFs to extract the STG. The attempt would be to re- 
move the added STG from the BFSM, to separate the 
original and the added states. 


(iii) Combinational redundancy removal. Bob may 
use the combinational redundancy removal, a procedure 
that attempts to remove the combinational logic that is 
not necessary for the correct behavior of the circuit. The 
proposed techniques of this class often take into account 
the set of reachable states of the FSM under examination 
[25]. Note that, the attacks that were described so far can 
greatly benefit from the ability to simultaneously moni- 
tor the multitude of signals/values on the IC using laser 
reading. 


(iv) RUB emulation. The goal of this attack is to create a 
reconfigurable implementation capable of realizing hard- 
ware that has the identical functional and timing charac- 
teristics to a RUB for which a legal key is already re- 
ceived. 


(v) Initial power-up state capturing and replaying 
(CAR). Bob knows the initial power-up state of an un- 
locked IC. He can use invasive methods to load the FFs 
of other ICs to the same power-up state as the unlocked 
IC and then utilize the same key to decode the new locks. 
Note that, unless invasive methods are used, the only way 
for Bob to alter the values in the FFs is to change the 
states using the input pins. Without the knowledge of the 
STG, the change of state can only be done as described 
in the first attack. This attack and the next two belong to 
the class of replay attacks. 


(vi) Initial reset state CAR. Bob scans the FF of an un- 
locked IC and reads the code of the reset state. Next, he 
employs invasive methods to load the FFs of other ICs to 
unlock them. 


(vii) Control signals CAR. In this attack, Bob attempts 
to bypass the FSM by learning the control signals and 
attempting to emulate them. Bob may completely bypass 
FSM by creating a new FSM that provides control signals 
to all functional units, and control logic (e.g. MUX’s and 
FFs) in the datapath. 
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Figure 6: Example of a black hole FSM. 


(viii) Creation of identical ICs using selective IC re- 
lease. Bob only releases the ICs with similar character- 
istics to Alice in the hope of finding the keys by correla- 
tions. This attack is probably the most expensive because 
it involves only a small percentage of manufactured ICs 
by the untrusted foundry. Only the ICs that have simi- 
lar RUBs are reported. Hence, if the attack is successful, 
the design house supplies many keys for ICs with sim- 
ilar RUBs; the birthday paradox shows that one of the 
keys with relatively high likelihood can be used on the 
unreported ICs. Note that, the way for Bob to determine 
closeness of characteristics is by looking at the distances 
of the initial power-up states. 


(ix) Differential FF activity measurement. Bob may 
start to investigate the differential activities of the FFs of 
the unlocked designs for the same input, and then try to 
eliminate the FFs that have different values. 


6.2 Countermeasures 


We propose a number of mechanisms to augment the 
basic active metering scheme and preserve its security 
against the above attacks. Two important observations 
are that FSMs in modern industrial design are always a 
very small part the overall design, well below 1%, and 
that STG recovery is a computationally intractable prob- 
lem [7, 10, 22]: 


e Creating black holes FSMs. Alice may create a black 
holes FSM inside the BFSM that makes the exit impos- 
sible. Black holes are the states that cannot be exited 
regardless of the used input sequence. Their design is 
very simple as shown in Figure 6, where the black states 
do not have a route back to the other states. Furthermore, 
a designer can plan the black hole states to be perma- 
nent if it is desirable: a small part may be added, so that 
restarting the IC would not take it out of the black hole 
states. This measure essentially eliminates the effective- 
ness of the first two attacks, because no random input 
sequence leads to the initial state of the functional FSM: 
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once the black hole sub-FSM is entered, there is no way 
out. A special case is creation of trapdoor black (gray) 
holes FSMs that are designed in such a way that only 
long specific sequence of input signals known just to the 
designer can bring control out of this FSM and into the 
initial functional state of the overall FSM. An issue that 
needs to be carefully addressed here is preventing the IC 
from powering-up in one of the black-hole states. This 
can be easily ensured by adding extra logic to the black 
hole parts that would disconnect the black hole states 
from the power-up states. 


e Merging the functional BFSM with the test and 
other FSMs, (e.g. ones that can be used for debugging 
and authentication). In a typical design, the functional 
control circuits are not the only FSMs around. Alice, 
with the the objective to make identification of her func- 
tional FSM more difficult, can further intricate the BFSM 
by co-synthesizing them with others. This augmenta- 
tion makes the first two and the three CAR attacks less 
effective. In particular, this merger would distract the 
ability to simultaneously monitor the multitude of sig- 
nals/values on the IC using laser reading. 


e Similar FF activity for the unlocked ICs. The de- 
signs would be made such that once an IC exits the 
locked states and is in its functional states, all its FFs 
have a deterministic behavior that is the same for all ICs. 
Thus, the differential FF activity screening would not 
yield any useful information. 


e Creation of specialized functional FSMs (SFFSMs). 
Alice can make the security much tighter by integrating 
the RUBs not just to assign the initial power-up state, but 
to alter the structure of the BFSM and make it a SFFSM. 
Using this method, the reset state for FSM of each IC is 
a function of its RUB. Each SFFSM operates correctly 
only if it received a specific stream of signals from the 
RUBs. Since there are exponentially many states with 
respect to the number of FFs in FSM, we map a set of 
blocks that share an identical subset of RUB outputs into 
a single SFFSM. This countermeasure makes the first 
two attacks (i.e., brute-force attack and FSM reverse en- 
gineering) much more difficult and the first two CAR 
attacks (i.e., initial control signal CAR and initial reset 
state CAR) almost impossible. 


A simple example of this method is presented in Fig- 
ure 7. On this figure, the added STG is shown by the 
cloud on left, and the original STG is plotted in the right 
cloud. The original STG has only 3 states: a reset state 
two other states. Here, the original STG is replicated 
twice: One replication is denoted by SFFSM’ and the 
other one is denoted by SFFSM”. The scheme adds logic 
to the added STG, so based on the bits in the RUB, it will 
be categorized into three classes. Each of the classes will 
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Figure 7: A simplified SFFSM. 


transition to one of the reset states in one of the clouds: 
original FSM, SFFSM’ or SFFSM”. This scheme will 
cause confusion in FFs scanning methods that aim at 
loading the reset state of an unlocked IC in the FFs of 
a locked chip. Note that, the replicated states need not 
all be unique, and maybe shared among the replicas to 
reduce the overhead. 

The example is very small, but one can add the RUB- 
dependent states at various stages to ensure that the at- 
tacker is not able to break the system. A combination 
of the SFFSM method and the state obfuscation and en- 
coding would ensure a full security of the design against 
the CAR attacks. Furthermore, using similar methods, 
the RUBs can be also added to the obfuscation scheme 
based on the dummy variables like the example presented 
in Figure 5, so that the same inputs would have different 
random obfuscation patterns. 

Another use of SFFSM is for addressing the effect of 
temporal changes in RUB. Recall that the actual appli- 
cation of the new hardware metering scheme to indus- 
trial designs requires mechanisms that ensure resiliency 
against time-dependent permanent changes of transistors 
as well as gate-level and transient changes due to the en- 
vironmental conditions such as temperature and supply 
voltage fluctuations [4]. 

The exact reconstruction of the first power-up state of 
IC (the particular one for which the designer released the 
key) for the purpose of defeating the variabilities is triv- 
ial: Bob can just load the captured and saved outputs of 
the first power-up RUB for which he has obtained the 
key. This mechanism makes the design susceptible to 
reuse attacks, where Bob can reuse the key and the ini- 
tial RUB for an unlocked IC to decipher another locked 
IC. However, if Alice included SFFSM in her design, she 
would be resilient against this attack. The only technical 
issue that remains to be addressed is to ensure that the 
SFFSM receives the correct data from the physical RUB, 
exactly the same as the one that was first received and for 
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which the key is available. Otherwise, the stored key will 
fail. 


In presence of temporal variations, ensuring that each 
SFFSM receives the correct data from RUB requires 
error-correction mechanisms. One solution is to em- 
ploy standard error-correction codes (ECCs). An alterna- 
tive hardware solution that encounters a lower overhead 
compared to ECC is to create the specifications of each 
SFFSM in such a way that it transitions into the correct 
next states, even when one or up to a specified number of 
the inputs from the RUB are altered by the environmental 
conditions. Using the hamming distances of the RUBs, 
we can group them into similar SFFSMs and synthesize 
the results such that the error correction mechanisms are 
inherently present. This mechanism is particularly effec- 
tive for longer RUBs that are required for present indus- 
trial designs. Note that, because the minterms for the 
combinational logic that implements transitions are now 
not smaller than for non-resilient versions of the SFFSM, 
the hardware overhead is often zero or negative at the 
expense of the lower resiliency against brute force at- 
tacks [5]. However, since the probability of brute force 
attack can be made arbitrarily small with very low over- 
head (i.e., by using the black holes), this is a favorable 
trade-off. 


e Resiliency against combinational redundancy re- 
moval. To overcome this attack, Alice must ensure the 
inapplicability of the attack to typical large circuits and 
the capability of this method to remove the added states. 
In general, computing a set of reachable states, can only 
be done for relatively small circuits, even when the im- 
plicit enumeration techniques are used. Thus, the method 
is only applicable to small circuits of small sizes. 


e Statistical characterization of gates. Alice can go 
one step further and attempt to derive the gate-level char- 
acteristics of the manufactured ICs by measuring the in- 
put/output signals and exploiting the controllability and 
observability into the design. Essentially, knowing the 
circuit diagram, she would be able to write a linear sys- 
tems of equations that can be solved for obtaining the 
approximate gate-level delay and power characteristics 
of the gates. She may even go further to use the extracted 
data to find the distribution of variations across the dif- 
ferent chips (e.g., by using methods such as expectation 
maximization(EM)). Now, if the variations do not have 
enough fluctuations, then she will get suspicious and can 
halt the unlocking. This computation would ensure that 
the selective IC release would not be successful. 


e Obfuscation of state activities and encodings. The 
implementation of the BFSM presented in the previous 
section renders it impossible to tell the difference be- 
tween the original FSM FFs and the added states FFs. 


This is because all of the FFs are changing all the time. 
Therefore, even though two states of BFSM in two ICs 
might be identical, the attacks based on scanning the 
FFs would not notice that, since a subset of the bits 
will be different. In other word, the FFs not used in 
the added FSM are randomly changing. Another ob- 
fuscation method that has already been implemented is 
that the states in the added STG are not in order and are 
coded out of sequence by the synthesis tool. Thus, even 
though there might be a direct transition (edge) between 
two states, the methods based on FFs readings would not 
notice the proximity of the two states, since there code 
words are distant from each other. 

Note that, the attacks that were described earlier, even 
the ones that are computationally very expensive, will 
not be able to unlock the ICs, if the countermeasures de- 
scribed above are in place. 


7 Experimental evaluations 


To test the applicability of the method described earlier, 
we implemented the active hardware metering on stan- 
dard benchmark designs. In this section, we present the 
experimental setup, followed by the overhead of imple- 
menting BFSM on the considered benchmarks. After 
that, we show quantitative analysis of the effectiveness 
of the brute force attacks. We further show how the addi- 
tion of black holes can make the scheme resilient against 
this attack with a minimal overhead. Note that, many 
of the attacks described earlier are assuming structural 
countermeasures that are hard to quantify and evaluate. 


7.1 Experiment setup 


We used extended set of sequential benchmarks from the 
ISCAS’89 to evaluate the impact of the active hardware 
metering method [6]. Even though the ISCAS’89 bench- 
marks are the latest comprehensive set of the gate-level 
designs, they are dated compared to the complex circuits 
in design, production and use today. Recall that follow- 
ing the Moore’s law, the size and complexity of the cir- 
cuits doubles approximately every 18 months. We use 
the larger benchmarks from the set, and we project the 
results to more complex circuits. Our projections show 
that the power, area, and delay overheads diminish as we 
increase the size and complexity. Simultaneously, the 
locking complexity and resiliency against the attacks ex- 
ponentially improves, due the multiplicity of states. We 
synthesize the benchmarks using the Berkeley SIS tool 
[26], that given a STG or a logic-level description of a se- 
quential circuit produces an optimized netlist in the target 
technology (cell library) while preserving the sequential 
input-output behavior. We have written a C program that 
modifies the benchmarks by adding the extra states. The 
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program calls SIS to obtain the specifications of the syn- 
thesized and mapped original and modified STGs. When 
evaluating the overhead results, the important observa- 
tion is that FSMs (i.e., the control circuitry) in modern 
industrial design are always a very small part of the over- 
all design, well bellow 1% [7, 10]. Thus, even doubling 
the overhead, will have a minimal impact on the overall 
circuit that is mostly occupied by memory, testing pins, 
and data path circuitry. 


7.2 Overhead of active hardware metering 


Our first set of experiments study the overhead of the in- 
troduced scheme in terms of area, power, and delay. It 
is worth noting here that our ultimate goal is to integrate 
the active hardware metering method in the design flow. 
Thus we have considered testing the approach on manu- 
factured ICs. However, the prohibitive cost of manufac- 
turing a circuit in aggressive technologies (the quote we 
got for fabricating a circuit in 65nm was $500K) limits 
our experiments to synthesizing the benchmarks. Table 1 
presents the results for the area overhead. Because of 
the relatively small size of the circuits, we added STGs 
with 12 FFs and 15 FFs overhead to the original STGs. 
The first column shows the name of the circuit from the 
ISCAS’89 benchmark. The second column shows the 
number of inputs to the circuit. The third column shows 
the number of outputs to the circuits. Both the number 
of inputs and the number of outputs do not change af- 
ter adding the extra states. The fourth column shows the 
number of FFs in the original circuit. The fifth column 
shows the area of the original circuit. Then we show both 
the new area and the percentage overhead after adding 12 
FFs and 15 FFs for the extra states. It can be seen that the 
percentage area overhead is decreasing as the circuit size 
increases. Thus, for larger circuit sizes, the area overhead 
will be even less insignificant. 

Table 2 shows the delay and power overheads. The 
first column contains the benchmark names. The second 
and third columns show the delay and power estimates of 
the original circuits. These are followed by both the de- 
lay and the percentage delay overhead, and the power and 
the percentage power overhead for adding both 12 FFs 
and 15 FFs STGs respectively. The delay overheads are 
universally small. With the exception of s27 that is too 
small to be considered practical, it is interesting to see 
that even other small benchmarks encountered no delay 
overhead after the addition of the new STG. For the small 
benchmarks that are not realistic compared to the current 
complex designs, the power increases significantly. As 
the circuit size increases, the percentage power overhead 
decreases. 

Next, we make a small model of the percentage of area 
and power overhead versus size of the circuit to extrapo- 


late to more complex designs. The size of the added STG 
is fixed to 15 FFs. Figures 8(a) and 8(b) show the over- 
head data vs. size along with the fitted polynomial mod- 
els, for power and area respectively. The plots suggest 
that as the circuit size increases, the percentage of power 
and area overheads both decrease. Note that, for more 
complex designs, it is required to add significantly more 
than 15 FFs. Even if adding a STG with 100 FFs would 
add six times the overhead of the 15 FFs case in absolute 
terms, the overhead would be negligible, while there will 
be 2°° extra states added to the design. Thus, for current 
and future circuit technologies, the BFSM would have a 
minimal impact on the performance in terms of power, 
area, and delay (i.e., it will most likely stay less than 1% 
of the overall design). 


7.3 Resiliency against the brute force at- 
tack 


Most of the attacks described in Section 6 can be en- 
countered by devising intelligent design strategies, as de- 
scribed in Subsection 6.2. The only attack that we quan- 
titatively study here is the brute force attack. We model 
this attack by randomly guessing the values on the graph 
until arriving at the functional reset state of the original 
FSM. 

We simulated the brute force attack on BFSMs with 
12, 15, and 18 FFs, varying the inputs from 3 to 8. In this 
experiment, we set an upper bound of 1,000,000 guesses; 
if the reset state is not reached after this many trials, 
the original STG is considered unreachable (denoted by 
N/R) and the brute force attack is reported unsuccessful. 

Table 3 shows the average number of guesses needed 
to unlock the BFSM over a 10,000 simulation runs. The 
first three rows show added STGs with 12, 15, and 18 
FFs respectively. The next two rows show the results for 
STGs with 12 and 15 FFs, after adding 1 and 2 black 
holes respectively. Although the number of inputs does 
not affect the overhead, it impacts the resiliency against 
the brute force attack: the table illustrates that the brute 
force attacks are less successful if we use more than 3 
different inputs. Also, as the size of the added STG in- 
creases, more guesses are necessary to unlock the circuit. 
By adding one black hole to the smaller FSMs, they per- 
form better than the larger FSMs. Adding one or two 
black holes makes the original STG unreachable for the 
brute force attack. It is worth noting here that STGs with 
12 and 15 FFs are really small, as they have a total of 
4,096 and 32,768 states respectively. If the active meter- 
ing scheme was to be implemented on current industrial 
strength designs, the added circuit would have at least a 
100 FFs that would create 2'°° ~ 102° states. It would 
be impossible for a brute force attack to find a key. Fur- 
thermore, addition of a few black holes will further make 
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12 FFs 
Area % 


15 FFs 
Area % 


Original Details 
In | Out | FFs | Area 
s27 
$298 
$344 
s444 
s526 
s641 
s713 
$953 
$832 
$1238 
$1423 
$9234 
s13207 
s38417 





Table 1: Area overhead of active metering for various benchmarks. 


12 FFs 
Delay % Power % 


1418.70 
2468.60 
2325.90 
2815.20 
3334.30 
2832.10 
2935.00 
3084.20 
4114.00 
4034.00 
6226.30 
13515.00 
20653.30 
39138.40 
113869.00 


15 FFs 
Power % 


1696.70 
2746.60 
2603.90 
3152.30 
3664.70 
3162.40 
3265.40 
3414.60 
4444.40 
4312.00 
6504.30 
14057.50 
20983.70 
39402.00 
114147.00 


Original Details 
Delay Power 


134.00 
1167.20 
1030.00 
1550.80 
2065.70 
1560.60 
1670.70 
1816.50 
2849.60 
2709.40 
4882.70 

12459.40 
19385.50 
37874.00 
112706.80 


Delay % 
s27 
$298 
s344 
s444 
s526 
s641 
s713 
$953 
$832 
$1238 
81423 
$5378 
89234 
s13207 
838417 





Table 2: Delay and power overhead of active metering for various benchmarks. 


the system resilient against the brute force attack. 


Table 4 shows area and power overheads for adding a 
black hole with 2 states to added STGs with 12 and 15 
FFs respectively. The overhead of adding a black hole 
does not exceed 5% even for very small benchmarks. For 
larger circuits it is unnoticeable. Note that, we often add 
more than one black hole to the design, to warrant the 
impossibility of the brute force attacks. 


To evaluate the diversity of keys, we studied the num- 
ber of cycles in the added STGs. For this STG, we form 
a new graph STG”, that has the same nodes as STG, but 
reverses the edges. Note that, simultaneously reversing 
all the edges will not affect the number of cycles in the 
graph. Since each state on STG has a path to the reset 
state, the directed acyclic graph (DAG) rooted at the orig- 
inal reset state in STG* will have a path to all states. We 
find a DAG of STG* by using the Dijkstra’s shortest path 


algorithm. Next, we add the STG* edges to the DAG and 
see if they form a cycle and combine the cycles into one 
node; we iteratively continue until the cycles are gone. 
This approximate method is used to count the number of 
cycles. Using the method, we roughly guess that the STG 
with 12 FFs had more than 40 cycles that enables the use 
to build exponentially many keys for traversal from a cer- 
tain state. The large number of keys can be easily gener- 
ated by a combination of cycling and switching between 
the cycles of the STG. 


8 Potential applications 


Active hardware metering provides strong anti-piracy 
mechanisms for hardware IP cores as well as remote- 
disabling mechanisms for the manufactured parts. Re- 
mote disabling can be accomplished if a malicious activ- 
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Figure 8: Percentage of (a) power; and (b) area; overheads vs. size after adding al 5 FFs STG. 


Number of inputs 
bits 3 4 5 6 7 8 


[2 74x85 | S708 | 78939 | A156 | 77028 | 82490 | 
[15 [500976 [610373 | 602157 | 357776 | 392681 | 596260 | 
P18 || 933680 | 932501 | oaeses | orssi2 | NR | NIR_| 
2+ bh 998000999000 [—NiR_[NIR_[NIR—|—NIR_| 


as+ph_| NR |_NR_|_NR_[-NR_|_NR_[_NR_| 
piz+2bn | NR [NR |_NR_[-NR_|NR_[_NR_| 





piz+2bh NR [NR | _NR—[ NR [NR [NR] 


Table 3: Average number of attempts needed for the brute force attack to unlock the added STG. 


ity is detected. For example, a designer can add an extra 
part to the circuit that detects say, the brute force attack 
where too many invalid inputs are being entered. As an- 
other example, the strange activity patterns of the chip 
may be detected from a network. Upon detecting such a 
situation, a built-in disabling function would be invoked 
that transitions the IC into a non-functional state. If this 
state is a black hole, the IC cannot be used. 

Generally speaking, combinations of the two em- 
ployed security mechanisms, variability-based unique- 
ness of each IC, and structural manipulation of FSM 
while preserving the original behavioral specification, 
provide powerful basis for creating many security and 
DRM protocols. A few of the many possibilities are: 
(i) use of a combination of unique functionality and 
RUB for remote authentication and disablement of smart 
cards; (ii) certification that a computation was executed 
on a specified IC in a distributed environment; and (iii) 
creation of techniques to produce software than can only 
run on a specific IC, thereby preventing software piracy. 

Furthermore, the introduced method has the potential 
for a broad impact on the IC industry and military use 
of hardware. As an example, new royalty enforcement 
systems can be enabled: design reuse has emerged as a 
dominant strategy, where different IP cores are often sup- 
plied by different vendors. The final integrator pays each 


IP supplier royalties that are proportional to the number 
of manufactured ICs. All that is needed for royalty en- 
forcement is that each supplier uses its own active meter- 
ing scheme inside its IP. 


9 Conclusion 


We propose the first active hardware metering scheme 
that symmetrically protects the IP designer and the 
foundry by providing a key-exchange mechanism. 
The active metering method utilizes the unclonable 
variability-based ID of each silicon circuit (RUB) to 
uniquely lock the IC at the fabrication house. The FSM 
of the design is enhanced to include many added states, 
designed such that the RUB-based state is one of the ran- 
dom states with a very high probability. The state addi- 
tion was done in such a way that it would not affect the 
functionality of the original design. The key to the locked 
IC can only be provided by the designer who knows the 
state transition graph of the design. We have illustrated 
the addition of black hole states to the BFSM which can 
be utilized for remote control and disabling of the ICs. 
Black hole states are also useful in making the protec- 
tion scheme highly resilient against the brute force at- 
tacks. We presented a low overhead implementation for 
the hardware metering scheme, identified a comprehen- 
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12 FFs 
% Area | % Power 


15 FFs 
% Area | % Power 
s27 
s298 
s344 
s444 
s526 
s641 
s713 
s953 
s832 
$1238 
81423 
$5378 
$9234 
s13207 
s38417 





Table 4: Percentage of area and power overheads after 
adding one blackhole. 


sive set of possible attacks, and provided mechanisms 
that make the scheme much more resilient against the at- 
tacks. Experimental evaluations of the proposed meter- 
ing method on standard benchmark circuits illustrate the 
low overhead and the applicability of the approach on 
industrial-size designs and its resiliency against different 
attacks. 
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Abstract 


The emergence of connections between telecommunica- 
tions networks and the Internet creates significant av- 
enues for exploitation. For example, through the use 
of small volumes of targeted traffic, researchers have 
demonstrated a number of attacks capable of denying 
service to users in major metropolitan areas. While such 
investigations have explored the impact of specific vul- 
nerabilities, they neglect to address a larger issue - how 
the architecture of cellular networks makes these systems 
susceptible to denial of service attacks. As we show in 
this paper, these problems have little to do with a mis- 
match of available bandwidth. Instead, they are the re- 
sult of the pairing of two networks built on fundamen- 
tally opposing design philosophies. We support this a 
claim by presenting two new attacks on cellular data ser- 
vices. These attacks are capable of preventing the use 
of high-bandwidth cellular data services throughout an 
area the size of Manhattan with less than 200Kbps of ma- 
licious traffic. We then examine the characteristics com- 
mon to these and previous attacks as a means of explain- 
ing why such vulnerabilites are artifacts of design rigid- 
ity. Specifically, we show that the shoehorning of data 
communications protocols onto a network rigorously op- 
timized for the delivery of voice causes that network to 
fail under modest loads. 


1 Introduction 


The interconnection of cellular networks and the Internet 
significantly expands the services available to telecom- 
munications subscribers. Once limited to basic voice ser- 
vices, these systems now offer data connections at the 
lower end of broadband speeds. Accordingly, devices 
attached to such networks are capable of engaging in 
applications ranging from traditional voice communica- 
tions to streaming video. While initial uptake of these 
services has been slow [1, 18], notable advances in con- 
nection speed and an expanded set of supported devices 


(e.g., laptops) are beginning to spur substantial accep- 
tance and usage. 


The transformation of these systems from isolated 
providers of telephony to Internet-attached general pur- 
pose communication networks has already been marred 
by concerns of inadequate security. As connections be- 
tween such systems and external data networks have 
developed, a number of researchers have noted weak- 
nesses in the telecommunications infrastructure. For ex- 
ample, our previous work on targeted text messaging at- 
tacks demonstrated the ability to deny service to large 
metropolitan areas with the bandwidth available to a sin- 
gle cable modem [16,47]. While these and a host of other 
exploits [39, 44] have explored the impact of specific at- 
tacks against cellular networks, they have all failed to an- 
swer a larger question: “How does the architecture of cel- 
lular data networks inherently make them susceptible to 
denial of service attacks?” Unexpectedly, the answer to 
this question has little to do with bandwidth constraints. 
Instead, these vulnerabilities are the result of the conflict 
caused by connecting two networks built on fundamen- 
tally opposing design philosophies. 

In this paper, we argue that low-bandwidth denial of 
service attacks in telecommunications networks are ar- 
tifacts of incompatibility caused by interconnecting sys- 
tems built with two differing sets of design requirements. 
While the merits of independent “smart” and “dumb” ar- 
chitectures have been widely debated, none have exam- 
ined the inherent security issues caused by the connec- 
tion of two mature systems built on these opposing de- 
sign tenets. To support our assertion, we present two 
new vulnerabilities in cellular data services. These at- 
tacks specifically exploit connection setup and teardown 
procedures in networks implementing the General Packet 
Radio Service (GPRS). Through a combination of anal- 
ysis and simulation, we characterize the impact of such 
attacks on legitimate voice and data services in the net- 
work. We then use these new attacks, in combination 
with previously discussed vulnerabilities, as demonstra- 
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ble evidence that the translation of traffic between these 
two network architectures is the root of such problems. 
Through this, we seek to develop a larger sense for why 
such attacks are possible, even in the presence of a cellu- 
lar network with hypothetically infinite bandwidth. Ulti- 
mately, by understanding causality, the discovery of fu- 
ture vulnerabilities is vastly simplified. 

In so doing, we make the following contributions in 
this work: 


e New Vulnerability Analysis: We identify and de- 
velop a realistic characterization of two new vulner- 
abilities in cellular data networks. These exploits 
target specific components of the expensive connec- 
tion setup and teardown procedures and can prevent 
legitimate use of data services. While the partition- 
ing of voice and data flows in such networks is de- 
signed to protect each traffic type from the other, 
our attack on setup mechanisms demonstrates that 
optimizations made for efficiency can result in the 
disruption of voice services. 


e Implications of Combined Design Philosophies 
on Security: We use the body of available vulnera- 
bilities as the basis for an analysis to determine the 
underlying cause of such denial of service attacks. 
Consequently, we show that these problems are not 
necessarily the result of poor protocol design but are 
instead deeply rooted in opposing architectural as- 
sumptions. 


The remainder of this paper is organized as follows: 
Section 2 offers a brief overview of our previous work on 
targeted SMS attacks to prime the reader with additional 
data points; Section 3 presents and offers an initial anal- 
ysis for our newly discovered vulnerabilities; Section 4 
uses monitoring of deployed cellular networks and sim- 
ulation to support the conclusions made in the previous 
section; Section 5 coalesces the previous attacks on cel- 
lular networks as data points in our larger argument; Sec- 
tion 6 offers a discussion of techniques to address such 
problems; Section 7 provides related work; Section 8 of- 
fers concluding thoughts. 


2 Prior Work - Text Messaging Attacks 


We present a high-level overview of our previous attacks 
on text messaging [16,47]. With some five billion mes- 
sages sent each month in the United States alone [28], 
this service has become one of the premier streams 
of revenue for cellular network operators. To encour- 
age widespread use, providers have opened a significant 
number of gateways between the Internet and their net- 
works. Whether through email, instant messaging appli- 
cations or even a provider’s website, it is possible to ex- 


change asynchronous communications with cellular sub- 
scribers. The ability to communicate across such net- 
works, however, is not without potential consequences. 

A cellular network! must perform multiple tasks be- 
fore delivering a text message. The network first con- 
ducts a series of lookups to determine the location of 
the destination device. The device must then be awo- 
ken from an energy-saving sleep state and authenticated. 
A connection can then be established and the incom- 
ing text message delivered. Critical to this process is 
the Standalone Dedicated Control Channel (SDCCH), 
which is responsible for the authentication and content 
delivery phases of text messaging. With a bandwidth 
of 762bps [6], this constrained channel is shared by the 
setup phases of both text messaging and voice calls. Con- 
sequently, by keeping the SDCCH saturated with text 
messages, incoming legitimate voice and text messages 
can not be delivered by the network. Understanding this, 
an adversary attempting to exploit this system can use 
web-scraping and feedback from provider websites to 
create “hit-lists” of targeted devices. By sending traf- 
fic to these targeted devices at a rate of approximately 
580Kbps, the adversary would be able to deny service to 
all of Manhattan. 

Attack mitigation techniques, ranging from queue 
management to resource allocation strategies on the air 
interface, were then shown to diminish much of the im- 
pact of such attacks. While successful, these counter- 
measures did not consider the use of cellular data ser- 
vices such as GPRS to alleviate targeted text messaging 
attacks. Logically, delivering data traffic over separate, 
higher bandwidth links should provide the most com- 
plete solution to this problem. However, as we show in 
the next section, it is possible to disrupt cellular data ser- 
vices with less bandwidth than was used in the original 
SMS attack. 


3 New Vulnerabilities in Cellular 
Data Services 


We present two new denial of service (DoS) vulnera- 
bilities in cellular data services. These attacks use a 
relatively small amount of traffic to exploit connection 
setup and teardown mechanisms. We use publicly avail- 
able specifications to provide an initial characterization 
of these attacks and as a means of demonstrating the 
potential for the interruption of data services in major 
metropolitan areas. 


3.1 Network Architecture 


Before a GPRS/EDGE? network provides any services 
to a mobile device user, a series of attachment and au- 
thentication procedures must take place. On power-up, 
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Figure 1: A high level network architecture for cellular data 
networks. 


a device (e.g., mobile phone) transmits a GPRS-attach 
message to the network. The base station forwards this 
message to the attached Serving GPRS Support Node 
(SGSN), which authenticates the user’s identity with the 
help of the Home Location Register (HLR). The HLR 
supports both voice and data operations in the network 
by keeping track of information including user location, 
availability and accessible services. When this process 
completes, the mobile device has a virtual connection 
with the network. 

In order to exchange packets with external networks, 
the mobile device must then establish a Packet Data Pro- 
tocol (PDP) context with the network. The PDP context 
is a data structure stored in the SGSN and the Gateway 
GPRS Support Node (GGSN) and is responsible for map- 
ping billing information, quality of service requirements 
and an IP address to a user device. While many phones 
do not currently automatically establish a PDP context 
on power-up, the trend towards doing so (e.g., email- 
capable phones and GPRS-equipped laptops) is rapidly 
increasing. As cellular providers move into the broad- 
band Internet market, such numbers will continue to ex- 
pand rapidly. 

Having been authenticated and registered, a mobile 
device is capable of exchanging packets with hosts in- 
ternal and external to the cellular network. At some time 
after attachment, a packet originating from an Internet- 
based host and destined for a mobile device arrives at the 
GGSN. The GGSN compares the destination IP address 
to those of established PDP contexts and, upon finding 
the corresponding entry, forwards the packet to the cor- 
responding SGSN. The SGSN begins the process of con- 
nection establishment and wireless delivery. Figure 1 
highlights this network architecture. 

The final hop of packet delivery occurs over the air in- 
terface. The details of this step, however, depend upon 
the current state of the device. As power has tradition- 
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Figure 2: A state transition diagram for mobile devices, in- 
cluding transition functions. 


Figure 3: When the first packet of a session arrives at the base 
station, the host must be paged and then assigned logical re- 


sources. The messages and channels used to accomplish this 
are shown above. 





Packet Paging Request (PPCH) ——-» 
~+ Packet Channe! Response (PRACH) —— L 
— Packet Resource Assignment (PAGCH) —» 
~<t— Packet Paging Response (PACCH) ——— 
+—\_ Packet Data Transfer (PDTCH) ———» 


ally been a concern in this setting, mobile devices are 
not constantly listening for incoming packets. To accom- 
modate this constraint, devices operate in one of three 
states: IDLE, STANDBY, and READY. Devices in the 
IDLE state are unregistered with the network and there- 
fore unreachable. In the power-saving STANDBY state, 
in which the vast majority of time is spent, devices pe- 
riodically listen for network “wake up” messages known 
as pages. Upon receiving a page from the network, the 
device transitions into the READY state. In this state, 
a device constantly monitors the air interface for incom- 
ing packets. When packets are not received for a number 
of seconds, devices transition back into the STANDBY 
state to conserve power. These three states and the tran- 
sitions between them are shown in Figure 2. 


On the arrival of the first packet in a flow, the SGSN 
begins the process of locating the targeted device. If the 
destination device is not currently in the READY state, 
the base station nearest to the device is unknown to the 
network. Accordingly, the SGSN creates paging mes- 
sages to be sent from a number of base stations. Upon 
receiving a paging request, a base station transmits a 
message to multiple sectors (i.e., service areas) over the 
Packet Paging Channel (PPCH). Whether due to inter- 
ference or sleep cycles, the paging process typically re- 
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quires multiple iterations. If the targeted device is awake 
and hears its temporary identifier in a paging message, it 
attempts to alert the network of its presence by respond- 
ing on the Packet Random Access Channel (PRACH). 
The base station receiving this response alerts the SGSN 
that the destination device has been located. The net- 
work then responds on the Packet Access Grant Chan- 
nel (PAGCH) with a message containing a list of Packet 
Data Traffic Channels (PDTCHs) that should be moni- 
tored for incoming data. The device acknowledges re- 
ceiving this message over the Packet Associated Control 
Channel (PACCH). At the end of this setup, as illustrated 
in Figure 3, the network can then route traffic directly to 
the READY state device. Note that the above channels 
are largely complementary to channels used for voice 
signaling (the naming convention, minus the “Packet” 
prefix, is the same). Because running two sets of con- 
trol channels leads to the underuse of limited spectrum, 
the standards documents indicate that it is acceptable for 
voice and data control channels to be shared [3,7]. 


3.2 Packet Multiplexing on the Air 
Interface 


Data services have been available from cellular networks 
for a number of years. Like voice telephony, these 
circuit-switched services required that a single endpoint 
monopolize a channel for the entire duration of its con- 
nection to the network. Regardless of whether this con- 
nection was used to constantly stream content or inter- 
mittently deliver packets, the provider charged the end 
user for the entire duration of the connection. Accord- 
ingly, demand for such inefficient services was not great. 
GPRS overcomes these limitations by multiplexing mul- 
tiple traffic flows over individual links. Accordingly, it 
is possible to serve a large number of users on a single 
physical channel concurrently and only charge them for 
the packets they exchange. 

GPRS provides data service by building on the times- 
lot structure of GSM. Specifically, a contiguous piece of 
radio spectrum is subdivided into equal timeslots. When 
assigned a timeslot, a user exerts temporary control over 
a small piece of the air interface. To provide the illusion 
of continuous control, sets of eight timeslots are grouped 
into a frame so that each can be serviced once every 
4.615ms. This sampling across timeslots creates physical 
channels, upon which voice, data and control traffic can 
be delivered. When used for data, these physical chan- 
nels are referred to as Packet Data Channels (PDCHs). 
Each set of 52 frames creates larger units known as mul- 
tiframes. These multiframes are subdivided into 12, four- 
timeslot blocks, with logical channels then mapped onto 
each block. The remaining four timeslots in a multiframe 
are used for time synchronization and signal strength 
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Figure 4: Each timeslot in a GPRS TDMA frame is used 
to create physical channels called Packet Data Channels (PD- 
CHs). Every 52-frame time period creates a multiframe, which 
is divided into twelve bursts of four. Each group, or bursts, 
holds a single logical channel. The specific allocation of these 
channels is dependent on the network. The remaining timeslots 
are used for time synchronization and idle measurement. 


measurement periods. For example, in Figure 4, block 
BO may function as a PPCH and blocks B1, B4 and B7 
may be used as PDTCHs ? [7]. 

When the first packet in a flow arrives at a base sta- 
tion for a user in STANDBY mode, the paging method 
described above occurs. As part of connection establish- 
ment, the flow receives a unique MAC layer label known 
as the Temporary Flow Identifier (TFI). Every subse- 
quent packet belonging to the Temporary Block Flow 
(TBF) is marked with this TFI so that a targeted mo- 
bile device knows which packets to decode. When the 
base station has no more packets to send to the destina- 
tion mobile device, the TBF and its associated TFI expire 
and can be reused by other flows in the immediate area. 
Upon TBF expiration, the mobile device returns to the 
STANDBY state. 


3.3. Exploiting Teardown Mechanisms 


Because the process of locating, paging and establishing 
a connection between the network and an end device is 
expensive, the immediate expiration of a TBF is imprac- 
tical. For example, minor variations in packet interar- 
rival times would force a system as described above to 
frequently relocate, repage and reestablish connectivity 
with users. Accordingly, networks implement a delayed 
teardown of resources. This means that devices remain 
in the READY state and retain their TBF for a number of 
seconds before the network attempts to reclaim its logical 
resources. When a packet is delivered to the user, the net- 
work sets a timer*, which is reset to its default value on 
the arrival of each additional packet. The standards rec- 
ommend a timer value of approximately five seconds [2]. 
Given that the connection establishment process requires 
roughly the same amount of time, such a value is entirely 
reasonable. 
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Because TFIs are implemented as a 5-bit field, an ad- 
versary capable of sending 32 messages to each sector 
in a metropolitan area can exhaust logical resources and 
temporarily prevent users from receiving traffic. Tar- 
geted devices would not need to be infected or controlled 
by the adversary; rather, hit-list generation techniques 
similar to those discussed in our previous work [16] 
could be used to locate hosts able to receive traffic. If 
this task can be repeated before the TBF timers expire, a 
denial of service attack becomes sustainable. In order 
to more explicitly characterize the bandwidth require- 
ments, we model such an attack on Manhattan using 
well known parameters [35,48]. Given an area of 31.1 
miles” and a sector coverage area of approximately 0.5 
and 0.75 miles?, Manhattan contains 55 sectors. Using 
a READY timer of 5 seconds and 41 byte attack packets 
(i.e., TCP/IP headers plus one byte), the delivery of le- 
gitimate data services in Manhattan could be prevented 
with the attack shown below: 


32 msgs < 41 bytes 1 


Capacity ~% 55 sectors x x 








1 sector 1 msg 5 sec 


v 


110 Kbps 


The exhaustion of all hypothetical TBFs may not be 
necessary given current usage and deployed hardware. 
As the current demand for voice services far outpaces 
cellular data usage, only a small percentage of physi- 
cal channels in a sector are used as PDCHs. Because 
GPRS/EDGE are not extremely high bandwidth services, 
allowing 32 individual flows to be concurrently multi- 
plexed across a single PDCH would be detrimental to 
individual throughput. Accordingly, often only a subset 
of the 32 TBFs (4, 8 or 16 [26,33]) are usable. The max- 
imum number of concurrent TBFs in a sector is there- 
fore min(d * u,32), where d is the number of down- 
link PDCHs and w is the maximum number of users per 
PDCH. While the number of PDCHs can be dynamically 
increased in response to rising demand for data services, 
networks typically hold unused channels to absorb spikes 
in voice calls. It is therefore unlikely that all 32 TBFs 
will be available at all times, if ever. A more realistic ap- 
proximation of the bandwidth required to deny access to 
data services is given by: 


4—16msgs_ 41 bytes y 1 


1 msg 


Capacity ~ 55 sectors x 


1 sector 5 sec 


14.1 3 56.4 Kbps 


2 


The brute-force method of attacking a cellular data 
network in a metropolitan setting is simply to saturate 
all of the physical channels with traffic. Even at their 
greatest levels of provisioning, the fastest cellular data 
services are simply no match against traffic generated by 


Internet-based adversaries [39,45]. Such attacks, obvi- 
ous by the sheer volume of traffic created, would likely 
be noticed and mitigated at the gateways to the network. 
However, with knowledge of the interaction between dif- 
ferent network elements, it is possible for an adversary 
to launch a much smaller attack capable of achieving the 
same ends. A basic understanding of the packet delivery 
process provides the requisite information for realizing 
this attack. 

Given a theoretical maximum capacity of 171.2 Kbps 
per frequency and as many as 8 allocated frequencies per 
sector, an adversary attempting the brute-force saturation 
of such a system would instead need to generate the vol- 
ume of traffic as calculated as: 


171.2 Kbps 
1 frequency 
73.56M bps 


8 frequencies 
Capacity ~ 55 sectors x a 
1 sector 


v 


By attacking the logical channels instead of the raw 
theoretical bandwidth, an adversary can reduce the 
amount of traffic needed to deny service to a metropoli- 
tan area by as much as three orders of magnitude. Note 
that networks implementing EDGE, which can provide 
three times the bandwidth of a GPRS system, would ex- 
perience the same consequences given the same volume 
of attack traffic. 


3.4 Exploiting Setup Procedures 


If connections to an end host must repeatedly be reestab- 
lished, the interarrival time between successive packets 
becomes exceedingly large. Delaying resource reclama- 
tion is therefore a necessary mechanism to ensure some 
semblance of continuous connectivity to the network. 
This latency, however, is not simply the result of the time 
required for a user to overhear an incoming paging re- 
quest. To better understand setup cost, we examine a 
network in which resource reclamation occurs immedi- 
ately after the last packet in a flow is received. 

Of particular interest to such an analysis is the per- 
formance of the common uplink channel, the PRACH. 
Because this channel is shared by all hosts attempting to 
establish connections with the network, the PRACH in- 
herently has the potential to be a system bottleneck. To 
minimize contention, access to the PRACH is mediated 
through the slotted-ALOHA protocol. Given a channel 
divided into timeslots of size t and time synchronization 
across hosts, end devices attempting to establish connec- 
tions transmit requests at the beginning of a timeslot. In 
so doing, the network reduces the amount of time during 
which collision can occur from 2¢ in the random access 
case to t. While slotted-ALOHA offers a significant im- 
provement over random access, its throughput remains 
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Figure 5: Blocking of legitimate traffic for varying at- 
tack traffic loads. Note that blocking only occurs on the 
PDTCH. These loads represent the entire attack bandwidth 
used across Manhattan. 


low. Given a traffic intensity of G messages per unit time, 
the normalized throughput + of slotted- ALOHA is: 
y = Ge? 

The maximum theoretical utilization of channel imple- 
menting slotted-ALOHA is 0.368. In reality, however, 
this value is significantly lower. As the number of in- 
coming connection establishment requests increases, so 
too does the need for retransmission due to collision. The 
throughput of such a system therefore typically stabilizes 
at a point far below this optimum value. Given a large 
number of paging requests, potentially caused by the im- 
mediate reclamation of resources as described above, the 
throughput of this already constrained channel would be 
severely degraded. Accordingly, the rate at which re- 
sponses to connection establishment requests will pass 
through this channel is much lower than the available 
bandwidth. Because the behavior of the PRACH is 
highly unstable and affected by feedback (1.e., retrans- 
missions due to collision), we leave the characterization 
of specific traffic volumes necessary to cause blocking to 
the next section. 


4 Attack Characterization 


In order to better characterize the observations made in 
the previous section, we extend the GSM simulator from 
our previous work [47] to include support for GPRS data 
services. The parameters of this simulator were set by 
information from a variety of sources. The means by 
which these parameters were chosen are discussed in the 
Appendix. 
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Figure 6: TFI utilization for a Manhattan-wide attack at 
200Kbps. Actual PDTCH utilization (not shown) is virtu- 
ally zero because of infrequent arrivals for these established 
flows. 


4.1 Modeling Attacks on Teardown 
Mechanisms 


To demonstrate the exploitation of delayed resource tear- 
down, we simulate a GPRS network under varying traffic 
loads. Although the full complement of TBFs may not be 
available in all real deployments [26, 33], we conserva- 
tively allow for up to 32 concurrent flows. When in use, 
each TFI is held for exactly five seconds unless a new 
packet arrives. While it is possible for a single device to 
obtain multiple TFIs, we assume that all incoming flows 
for a given destination share a single TBF [4]. Because 
of observations made on deployed networks, both voice 
and data setup requests share a number of control chan- 
nels. We therefore replace data control channels with 
their voice equivalents (i.e., RACH instead of PRACH). 

Legitimate voice and data calls were modeled as Pois- 
son random processes and generated at rates of 50,000 
and 20,000 per hour, respectively, across Manhattan. The 
duration of these flows are also generated in a simi- 
lar fashion with means of 120 and 10 seconds, respec- 
tively. These values represent standard volumes and ex- 
hibit no blocking. Attack flows, each consisting of a sin- 
gle packet, are also modeled by a Poisson random pro- 
cess with rates ranging from 100-200 Kbps. Each run, of 
which there were 1000 iterations for each attack load, 
simulated an hour of time with attacks occupying the 
middle 30 minutes. 

Figure 5 shows the blocking rates of legitimate traffic 
caused by an attack on the delayed teardown mechanism. 
At arate of 160 Kbps or greater, the ability to use cellu- 
lar data services within Manhattan is virtually nonexis- 
tent. The amount of traffic required to execute such an 
attack is slightly greater than the estimation of a perfect 
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Figure 7: Blocking caused when immediate resource recla- 
mation is enforced on data sessions. Notice that because both 
voice and data flows use the RACH, increased data requests 
cause voice blocking. No blocking was observed on other 
channels. 


scenario in Section 3.3 due to the exponential interar- 
rival rate used to generate packets. However, because 
this more realistically represents the nature of packet de- 
livery in a network given the presence of other traffic, it 
offers a more accurate characterization of the attack. In 
spite of having the potential to deliver large volumes of 
traffic once flows are established, these results demon- 
strate that use of cellular data services can in fact be de- 
nied with less bandwidth than was used in the targeted 
text messaging attacks [16, 47]. 

Figure 6 offers additional insight into the attack by 
providing the utilization profile for a number of channels. 
Most importantly, only the PDTCHs operate at capacity 
during the attack. This utilization represents the state of 
virtual resources, not channel bandwidth. None of the 
channels responsible for delivering voice, most critically 
the traffic channels (TCHs), are measurably affected by 
the increase in data traffic. Note that this is deliberate 
as cellular data services such as GPRS are designed to 
completely separate voice and data services. 


4.2 Modeling Attacks on Connection Setup 


To characterize the impact of frequent connection 
reestablishment on a cellular data network, we simulate 
a variety of traffic levels in the presence of immediate 
resource recovery. Specifically, when the base station 
no longer has packets to send for a particular flow, the 
targeted device returns to the STANDBY state. Except 
for delayed teardown procedures, all network settings 
and conditions including legitimate traffic volumes and 
interarrival patterns, remain the same. Attacks in this 
scenario, each of which occurs according to a Poisson 
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Figure 8: The impact of RACH congestion on voice calls. 
Notice that during the attack phase, voice call blocking on 
the RACH causes a significant under utilization of traffic 
channels. 


random distribution, range from 2200-4950 Kbps spread 
across all of Manhattan. As in our previous experiments, 
each attack traffic level was run for 1000 iterations. 


Figure 7 shows the blocking rates for legitimate traffic 
on a number of channels. Unlike the attack in the previ- 
ous section, in which PDTCH blocking occurred because 
of TBF exhaustion, no loss of packets was observed on 
the PDTCHs. In spite of this, the results of these sim- 
ulations confirm a more significant vulnerability - both 
voice and data flows experience blocking on the RACH. 
Although such networks strive to separate voice and data 
traffic, the dual use of control channels allows misbehav- 
ior in one realm to affect the other. Generating just over 
3 Mbps of traffic for the entire city of Manhattan, an ad- 
versary is capable of blocking nearly 65% of all traffic - 
voice and data. For a network in which a blocking prob- 
ability of 1% is typically viewed as unacceptable, such 
an attack represents a serious operational crisis. 


Figure 8 provides further information about the im- 
pact of the 4950Kbps attack on voice and data services. 
The most notable consequence of this attack is observ- 
able in the nearly 80% decrease in TCH utilization. The 
near zero utilization of PDTCHs offers an explanation to 
the lack of blocking observed in the previous figure - the 
majority of legitimate traffic is being filtered out before 
it can ever be delivered by the PDTCHs. Accordingly, a 
network using the settings described above is subject to 
attacks capable of denying both voice and data services. 
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Figure 9: Given a connection establishment latency and the size of requests (in packets), we examine the impact of varying 
bandwidth on system throughput. When the available bandwidth allows for the virtually instantaneous delivery of requests, system 
throughput plateaus. This result indicates that bandwidth is ultimately not the bottleneck in this system. (log-scale) 


5 The Meeting of Conflicting System 
Design Philosophies 


At first glance, the differences between each of the at- 
tacks on cellular networks appear stark. Targeted text 
messaging attacks fill and maintain a low-bandwidth 
control channel at capacity. Adversaries attacking cellu- 
lar data services exhaust virtual resources or take advan- 
tage of access protocol inefficiencies. In reality, all of 
these vulnerabilities are remnants of a conflict between 
the design philosophies of telecommunications and tra- 
ditional data networks. Specifically, they are the result 
of contrasting definitions of a flow and the role of net- 
works in establishing them. To make such a claim more 
concrete, we begin by demonstrating how a pair of seem- 
ingly adequate techniques for mitigating the above at- 
tacks fails to do so. 


The most obvious approach to addressing the data at- 
tacks described in Section 3 is to expand the range of 
possible TFI values. Unfortunately, as mentioned ear- 
lier, these limitations are necessary given the bandwidth 
available to GRPS/EDGE networks. The use of 32 (or 
fewer) concurrent flows per sector is a requisite conces- 
sion for providing basic levels of connectivity between 
the network and end devices. In order for an increased 
pool of identifiers to have a meaningful effect, the band- 
width available to data services would also need to be 
significantly increased. This combination of approaches 
is actually implemented in 3G cellular networks such as 
UMTS [8]. However, even these networks suffer from 
the high cost of connection establishment (i.e., deliver- 
ing the first packet in a flow). 


A session establishment period lasting a few seconds 
represents only a small fraction of the total lifetime for a 
connection persisting for a number of minutes. Given 
the limited amount of spectrum allocated to cellular 
providers, such infrequently used channels predictably 
occupy as little space as possible to avoid wasting band- 
width. Because the duration of a packet flow may not 
provide sufficient time over which such an expense can 
be amortized, the minimal allocation of bandwidth to 
connection establishment may in fact create a system 
bottleneck. To capture the impact of additional band- 
width on connection setup, we offer a simple model of 
request throughput for a sector as follows: 


#t Packets 


# Packets 
Bandwidth 


Throughput 





Setup Latency + 


If the expense associated with connection establishment 
was the result of inadequate resources, an increase in 
bandwidth should alleviate much of this cost. Such a 
scenario would be equivalent to increasing the size of 
the smallest link in a traditional data network to im- 
prove end-to-end throughput. However, the calculated 
effects of increased bandwidth on overall throughput are 
extremely limited in this setting. Because connection 
establishment exchanges contain fixed-length messages 
and not the variably sized packets of data delivery, the 
presence of additional bandwidth does little to improve 
performance after each channel can send paging requests 
instantaneously. As is shown in Figure 9, the limit of 
system throughput as bandwidth approaches infinity be- 
comes: 
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Figure 10: Increasing the number of channels can improve overall system throughput. However, individual throughputs and 
connection setup times react inversely. Reducing the expense of connection establishment must therefore come from a reduction in 


connection setup latency. (log-scale) 
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Increasing system throughput can, for this reason, be ac- 
complished in one of two ways. In the first, the number 
of channels over which connections can be sent could 
be increased. Such a change would allow many more 
connection establishment requests to be sent in paral- 
lel. While increasing the throughput of the system as 
a whole, this approach would prove detrimental to in- 
dividual users. As shown in Figure 10, subdividing a 
fixed bandwidth into additional channels intuitively re- 
duces the throughput of a single user. Adding extra chan- 
nels could also potentially create elevated contention for 
the shared uplink channel (RACH). More importantly, 
increasing the throughput of the system does not neces- 
sarily reduce cost with respect to delay experienced by 
individual users. Therefore, 


Decreasing the cost of connection establish- 
ment in a cellular data network is not a matter 
of increasing bandwidth but rather the reduc- 
tion of connection setup latency. 


The concept of connection establishment is consider- 
ably different in cellular and traditional data networks. 
In the case of the former, the network must page, wake, 
and negotiate with a targeted device before ultimately de- 
livering traffic. Whether due to misaligned sleep cycles, 


missed paging messages or congestion, this set of oper- 
ations can require more than five seconds before being 
able to transmit data. As discussed in Section 3, these 
concessions are made because the network assumes that 
end devices are limited both in terms of power and com- 
putational ability. True packet-switched networks pro- 
vide no such services; rather, higher layers in the pro- 
tocol stack implement functionality as needed. In gen- 
eral, each packet is treated as an individual entity and is 
simply forwarded to the next logical hop. Whether it is 
wired or wireless in nature, there is no connection to be 
established from the perspective of the network*. Nodes 
responsible for routing packets do not assume that their 
next hop neighbors have any specific abilities other than 
moving the packet closer to its intended destination. Ac- 
cordingly, connection setup latency is more accurately 
depicted as propagation delay from the viewpoint of 
these networks. Given that the delay of propagation time 
and connection establishment differ by many orders of 
magnitude, the underlying cause of low-bandwidth at- 
tacks on cellular data networks becomes more clear. 


The vulnerable components in both the targeted text 
messaging and cellular data service attacks are those 
mechanisms responsible for translating traffic from one 
network architecture to another. While a data network 
simply forwards individual packets as they arrive, a cel- 
lular data network interprets the first packet in a flow as 
an indicator of more traffic to come. Rather than sim- 
ply forward that packet to its final destination, the net- 
work dedicates significant processing and bandwidth re- 
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Figure 11: A comparison of the cost of delivering a single packet in cellular and traditional data networks. In the cellular data case 
(left), a significant amount of delay is added because of connection establishment procedures, whereas the router in the traditional 


setting (right) simply forwards the packet to the final hop. 


sources to ensure that the end device is ready to receive 
data. This assumption is valid in traditional telephony 
because of the nature of voice communication. Except 
for cases of an immediate hangup, sessions are guaran- 
teed to contain multiple “packets” of information. Data 
communications, however, do not necessarily share this 
characteristic. Any protocol or application generating 
packets separated by a number of seconds (e.g., instant 
messaging programs, session keep-alive messages, ap- 
plications implementing Nagle’s algorithm [34]) violates 
this model. Whether it is embodied by text messages or 
data traffic, the amplification of a single incoming packet 
into a series of expensive delay inducing setup operations 
is the source of such attacks. Figure 11 reinforces this 
conclusion by comparing generalizations of the two ar- 
chitectures. 

Connection establishment in cellular and traditional 
networks are so different because the philosophies upon 
which these systems are based are incompatible. The no- 
tion that the middle of a network provide only a limited 
set of simple functions is at the core of the end-to-end 
principle [42]. By making no assumptions about the con- 
text in which a packet’s contents will be used, the net- 
work is free to specialize in a single task - moving data. 
Services not used by all applications, including reliable 
delivery, content confidentiality and in-order arrival, be- 
come the responsibility of higher layers of the protocol 
stack in the end hosts. The concentration on sending 
packets allows networks built according to the end-to- 
end principle to be flexible enough to support new appli- 
cation types and usage models as they emerge. Telecom- 
munications networks are built on the opposite model. 
Hard service requirements, especially for real-time in- 
teraction, forced the network to provide the majority of 
service guarantees. Because the functionality of the net- 
work was once limited to voice applications, telecommu- 
nications systems could be tightly tailored to a specific 
set of constraints. The inclination to build a network in 
such a manner was addressed by the original end-to-end 
argument: 


“Because the communications subsystem is 
frequently specified before the applications 


that use the subsystem are known, the designer 
may be tempted to “help” the users by taking 
on more function than necessary.” [42] 


Because these specialized networks implement more 
functionality than is absolutely necessary, they exhibit 
rigidity, or the inability to adapt to meet changing re- 
quirements or usage [15]. Rigidity in design causes such 
systems to enforce assumptions appropriate for one sub- 
set of traffic on all others. The treatment of each packet 
as part of a larger flow is one embodiment of such in- 
flexibility. This rigidity is also apparent when examined 
from the perspective of evolving end devices. For ex- 
ample, many laptops now contain hardware supplying 
access to cellular data networks [21,37]. Regardless of 
their ability to implement services at higher layers of the 
protocol stack or their access to power, these end devices 
are forced to transition between STANDBY and READY 
states simply because such behavior is mandated by the 
network. Devices connecting via 802.11 could simply 
trade off the overhead associated with paging at the cost 
of additional power use. This point is made more obvi- 
ous when put in the context of home or office LANs sup- 
ported by a cellular backhaul connection. The network 
would require such systems to participate in the process 
of location determination and connection establishment 
in spite of their lack of mobility. By building assump- 
tions and services into the network itself, the system as 
a whole is made less flexible. When conditions change 
and assumptions fail to hold, the rigidity of cellular data 
systems causes them to break. 


6 Constructing Robust Cellular Data 
Networks 


Addressing the specific attacks detailed in this paper may 
be realistic in the short term. Optimized paging tech- 
niques [9,25] may help to reduce search time and its re- 
sulting delay. As was done with the SMS attacks [47], 
techniques from queue and resource management could 
be used to mitigate blocking on the RACH. The move to 
3G and a significantly larger pool of identifiers would re- 
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duce the practical likelihood of virtual resource exhaus- 
tion. While such methods would indeed mitigate many 
of the example vulnerabilites discussed in this work, a 
strategy for building robust cellular data systems based 
on constant patching would ultimately fail. All of the 
above solutions merely treat the symptoms of a larger 
problem. Accordingly, as long as there is a disconnect 
between the ways in which data is delivered in cellu- 
lar and traditional data systems, exploitable mechanisms 
will exist. Such mechanisms need not be limited to the 
wireless portion of the network; rather, any component of 
the core network involved in establishing a session will 
be vulnerable. 


The larger issue discussed in this paper, that of 
vulnerability caused by the exchange of traffic across 
two incompatible networks, will not be easily solved. 
Genuinely addressing this problem will require notable 
changes to the interaction between cellular data networks 
and end devices. Once such technique might require 
a significant increase of location awareness on the side 
of the network. Between the generation of paging lists 
and bandwidth used in multiple sectors, significant pro- 
cessing resources and time are spent finding a device 
each time a connection establishment occurs. Instead of 
knowing that a device is serviced by a potentially large 
set of base stations, an improved system might require 
location update information from a device each time it 
moves between sectors. Used in concert with much 
shorter sleep cycles, such an improvement to location 
knowledge may make the elimination of paging possible. 
This approach, however, would have a serious impact on 
resources in both end devices and the network. From 
the user perspective, increased monitoring and interac- 
tion with the network would negatively impact battery 
life. In the case of the latter, the overhead needed to pro- 
cess such an increase in messaging would also affect net- 
work performance. A more radical approach would be to 
replace cellular data services with a new high-bandwidth 
wireless protocol. Instead of necessarily sharing band- 
width and timeslotting schemes with voice communica- 
tions, this new protocol would be assigned to a separate 
portion of the spectrum. In so doing, designers of the 
new data system would not be constrained by any of the 
rigidity forced upon current cellular data networks. In 
addition to technical tradeoffs, this solution would also 
need to deal with the complexities involved in spectrum 
allocation - reducing its viability for the forseeable fu- 
ture. 


These solutions are not an endorsement of any tech- 
nology or architecture over another. Instead, they are 
simply the product of an observation of the impact on 
availability caused by interconnecting diametrically op- 
posed methods of system design. Being beholden to a 
specific architecture and failing to understand the prob- 


lems caused by linking such networks are in fact the 
causes of the rigidity seen in this system. It is highly 
unlikely that similar thinking will correct the problem. 


7 Related Work 


Representing perhaps the oldest functioning digital sys- 
tems, telecommunications networks have evolved signif- 
icantly since their inception over 100 years ago. While 
the nature of these systems themselves has transformed 
from manually configured and static to automated and 
mobile, many consumer behaviors have remained largely 
unchanged. Specifically, the frequency and duration 
of user calls have become largely predictable behav- 
iors. System designers have used these anticipated con- 
ditions to optimize resource allocation throughout their 
networks. The degree to which telecommunications net- 
works are tailored to such behavior quickly becomes ob- 
vious in the presence of unexpected changes to network 
usage. For example, the explosion in use of dial-up 
modems in the early 1990s caused widespread conges- 
tion because users were remaining connected for longer 
than expected time periods. Temporary fluctuations or 
surges, such as those seen minutes after the attacks on 
September 11th 2001, often render telecommunications 
networks unusable [35]. Such systems do not gracefully 
degrade under increased traffic volumes; rather, they of- 
ten cease to provide service to the vast number of sub- 
scribers. 

Recognizing this, our previous work focused on 
the ability to recreate the consequences of such high- 
traffic denial of service events through the use of low- 
bandwidth attacks. Using targeted loads of text mes- 
sages, we were able to demonstrate the ability to deny 
voice and SMS service to major metropolitan areas with 
the bandwidth available to a cable modem [16]. We later 
characterized these attacks through simulation and mea- 
surement and discussed the tradeoffs inherent to a num- 
ber of mitigation strategies [47]. Serror et al. [44] of- 
fered additional insight by exploring attacks on call pag- 
ing channels. Ricciato [39] provided a general discus- 
sion of the potential to flood data channels in next gen- 
eration networks with traffic generated by Internet-based 
pathogens. Raccic [36] and Mulliner [32] then examined 
attacks on MMS. While by no means the only methods 
of causing service outages, these attacks are the first to 
address the potential for denial of service made possible 
by the connection between cellular networks and the In- 
ternet. 

Denial of service attacks have been studied in a va- 
riety of other contexts. Websites ranging from DNS 
roots [17], search engines [40] and software vendors [19] 
to online casinos [10] and news services [41] have all 
been temporarily disabled by overwhelming volumes of 
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traffic. Real-world processes and resources connected 
to the Internet, including banking networks, emergency 
services [30] and even postal delivery [13] have also 
been subjected to such attacks. In response, significant 
work has been undertaken to classify [29] and allevi- 
ate [22—24, 43, 46,49-52] such problems. Unfortunately, 
none of these solutions have been widely deployed. 

The debate over which network architecture is more 
resilient against such problems has raged for nearly 30 
years. Advocates of the “smart” network, which is em- 
bodied by centralized control and decision-making, ar- 
gue that this architecture provides the ability to prevent 
such overloading from occurring [31]. Supporters of 
“dumb” network architectures, which are built around 
the end-to-end principle [11, 12, 38, 42], contend that 
placing such control in the network itself dampens the 
ability to perform its intended task - routing packets. 
While both approaches have their tradeoffs, the discus- 
sion of the consequences of connecting systems that deal 
with transferring information in fundamentally different 
ways has not been addressed from the perspective of se- 
curity. 


8 Conclusion 


Efforts to address recently discovered vulnerabilities in 
cellular networks have focused on treating symptoms in- 
stead of the disease. Attempts to solve individual ex- 
ploits have been largely ad-hoc and, in their efforts to 
mitigate specific problems, create significant additional 
complexity and vulnerabilities in these systems. With- 
out an understanding of why such attacks are happening, 
this cycle of vulnerability discovery and patching will 
continue indefinitely. The problems presented in this and 
other papers are artifacts of a larger architectural mis- 
match. Specifically, in spite of a concerted effort to sup- 
port packet-switched traffic, cellular data networks are 
still, at their essence, circuit-switched systems. Because 
of this inflexibility, any mechanism responsible for con- 
nection establishment in these networks is vulnerable to 
a low-bandwidth denial of service attack. 

We arrive at this conclusion by making the following 
contributions: 


e Although conventional wisdom suggests that in- 
creased bandwidth provides robustness against such 
attacks, we use two new vulnerabilities to demon- 
strate that low bandwidth denial of service attacks 
can prevent legitimate access to cellular data ser- 
vices. In so doing... 


e ... we demonstrate that a mismatch of bandwidth 
between cellular data networks and the Internet is 
not the cause of such attacks. Instead, they are the 


result of the contrasting ways in which “smart” and 
“dumb” networks treat flows. From this... 


e ...we show that in their uniform treatment of all 
flows, regardless of size or duration, cellular data 
networks exhibit design rigidity. By building signif- 
icant assumptions about the behavior of traffic into 
the network itself, such systems are made brittle in 
the face of changing conditions. 


Addressing these issues can therefore come from one 
of two approaches. In the first, methods of safely trans- 
lating traffic between packet- and circuit-switched net- 
works could be developed. Alternatively, such networks 
could be redesigned to truly support packet-switched 
mechanisms. By genuinely separating voice and data, 
not only in the spectrum they occupy but also in the tech- 
niques through which they are delivered, robust cellular 
data networks could be constructed. In the absence of 
such changes, cellular networks will continue to remain 
vulnerable to low-bandwidth exploits. 
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Notes 


'We use the GSM architecture to provide specific details in our ex- 
planation. Similar mechanisms exist in other cellular networks. 

?Enhanced Data rates for GSM Evolution (EDGE) is largely equiv- 
alent to GPRS. The most significant difference is the use of a new 
wireless modulation technique known as 8-phase shift keying (8PSK), 
which allows higher data rates. 

3Note the subtle difference in naming. PDTCHs are virtual channels 
that are run on top of physical PDCHs. 

4This timer is referred to in the specifications as T3169 [2]. It is 
actually started when the counter N3101, which indicates the number of 
radio blocks that have passed since the last exchange with the targeted 
device occurred, reaches its maximum value. Our description above is 
meant to simplify the exact mechanisms for the reader without loss of 
precision. 

>We consider connection establishment in terms of individual flows. 
Initial access to almost every network has a cost (authentication, etc). 
This startup cost, however, is amortized in both settings. 

®At the time of this writing, Cingular Wireless had not yet been 
renamed AT&T. 

7The voice network equivalent of the PRACH is employed due to 
the observed presence of dual-use control channels. 
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Appendix 


Simulator Design 


We extend the GSM simulator built in our previous 
work [47] to provide support for GPRS data service. In 
total, the project contains nearly 10,000 lines of code 
(an addition of approximately 2,000 lines) and support- 
ing scripts. A high-level overview of the components 
is shown in Figure 12, where solid and broken lines in- 
dicate message and reporting flows, respectively. Traf- 
fic is created according to a Poisson random distribution 
through a Mersenne Twister Pseudo Random Number 
Generator [20], saved to a file and then loaded at run- 
time. The path taken by individual requests depends on 
the flow type. We focus on the data path as the behavior 
of SMS and voice messages were explained in the previ- 
ous iteration of the simulator. 

If the network has not currently dedicated resources 
to a flow on the arrival of a packet, it is passed to the 





Figure 13: A Samsung Blackjack (SGH-i607) running in Field 
Test Mode provides operational data on the associated cellular 
network including channel configuration (shown here) and sig- 
nal strength. 


RACH module. This random access channel is imple- 
mented in strict accordance with 3GPP TS 04.18 [5] and 
is tunable viamax_retrans and tx_integer values. 
Messages completing processing in the RACH are then 
delivered to the Service Queue Manager module, which 
in turn redirects data packets to the PDCH module. If 
a TFI is available, the packet is assigned the virtual re- 
source, timers are set to five seconds and the packet is 
then delivered according to a FIFO ordering. The arrival 
of additional packets in a flow resets the timers to their 
default values to maintain resource control. When timers 
expire, the network reclaims a TFI for use in the delivery 
of other flows. Packets arriving at the Message Gener- 
ation Manager as part of an active flow bypass the con- 
nection setup phases of the network and move directly to 
the PDCH module. 

The accuracy of simulation was measured in two 
ways. The components used by voice and SMS were 
previously verified using a comparison of baseline sim- 
ulation against calculated blocking and utilization rates. 
With 95% confidence, values fell within +0.006 (on a 
scale of 0.0 to 1.0) of the mean. The simple nature of 
the PDCH module allowed verification of correctness 
through baseline simulations and observation. 





Parameter Setting 


When possible, we use settings found in currently de- 
ployed cellular data networks. However, such values are 
largely unpublished or unavailable to the general pop- 
ulation. To find this information, we ran a Samsung 
Blackjack (SGH-i607) attached to the Cingular Wire- 
less network © [14] in Field Test Mode. This mode of 
operation effectively turns a phone from a communica- 
tions device to a network auditing platform. In addition 
to reporting the identification and signal strength read- 
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ings of nearby base stations, Field Test Mode provides 
network deployment information including channel al- 
location and layout. Accordingly, use of this mode of 
operation is typically restricted; however, access codes 
and device firmware upgrades are readily available on- 
line. As is shown in Figure 13 and of particular inter- 
est to properly modeling the behavior of real networks, 
the field PBCCH Present FALSE indicates that voice 
and data control traffic use the same channels. This con- 
figuration, as previously discussed, is permitted by the 
standards [7] and effectively minimizes the amount of 
spectrum reserved for control information. Such a setting 
is believed to be common across the majority of provider 
networks. From these observations, the establishment of 
voice and data connections occurs over shared control 
channels in our simulations. 

Other parameters are set using additional literature. 
For example, the RACH ” is optimally set to reduce the 
probability of request blocking by allowing up to the 
maximum of seven retransmissions per request by the 
base station [27]. 
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Abstract 


The growing popularity of wireless networks and mo- 
bile devices is starting to attract unwanted attention 
especially as potential targets for malicious activities 
reach critical mass. In this study, we try to quantify 
the threat from large-scale distributed attacks on wireless 
networks, and, more specifically, wifi networks in densely 
populated metropolitan areas. We focus on three likely 
attack scenarios: “wildfire” worms that can spread con- 
tagiously over and across wireless LANs, coordinated 
citywide phishing campaigns based on wireless spoofing, 
and rogue systems for compromising location privacy in 
a coordinated fashion. The first attack illustrates how 
dense wifi deployment may provide opportunities for at- 
tackers who want to quickly compromise large numbers 
of machines. The last two attacks illustrate how botnets 
can amplify wifi vulnerabilities, and how botnet power is 
amplified by wireless connectivity. 


To quantify these threats, we rely on real-world data 
extracted from wifi maps of large metropolitan areas in 
the States and Singapore. Our results suggest that a care- 
fully crafted wireless worm can infect up to 80% of all 
wifi connected hosts in some metropolitan areas within 
20 minutes, and that an attacker can launch phishing at- 
tacks or build a tracking system to monitor the location 
of 10-50% of wireless users in these metropolitan areas 
with just 1,000 zombies under his control. 
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1 Introduction 


The last two decades of network security research have 
demonstrated that attackers are continuously evolving, 
exploring creative ways to exploit systems, and targeting 
new technologies and services as they emerge. Indeed, 
the widespread use of email brought spam and email- 
viruses; broadband connectivity was followed by the rise 
of rapid self-propagating worms; while the growing use 
of online personal services and electronic commerce re- 
sulted in sophisticated personal data theft attacks, includ- 
ing phishing. Such trends suggest that any technology 
that reaches some kind of critical mass will attract the 
attention of attackers. 


At the same time, modern attacks such as worms, 
spam, and phishing exploit gaps in traditional threat 
models that usually revolve around preventing unau- 
thorized access and information disclosure. The new 
threat landscape requires security researchers to consider 
a wider range of attacks: opportunistic attacks in addi- 
tion to targeted ones; attacks coming not just from ma- 
licious users, but also from subverted (yet otherwise be- 
nign) hosts; coordinated/distributed attacks in addition 
to isolated, single-source methods; and attacks blending 
flaws across layers, rather than exploiting a single vul- 
nerability. Some of the largest security lapses in the last 
decade are due to designers ignoring the complexity of 
the threat landscape. 


The increasing penetration of wireless networking, 
and more specifically wifi, may soon reach critical mass, 
making it necessary to examine whether the current state 
of wireless security is adequate for fending off likely 
attacks. This paper discusses three types of threats 
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that seem insufficiently addressed by existing technol- 
ogy and deployment techniques. The first threat is wild- 
fire worms, a class of worms that spreads contagiously 
between hosts on neighboring APs. We show that such 
worms can spread to a large fraction of hosts in a dense 
urban setting, and that the propagation speed can be such 
that most existing defenses cannot react in a timely fash- 
ion. Worse, such worms can penetrate through networks 
protected by WEP and other security mechanisms. The 
second threat we discuss is large-scale spoofing attacks 
that can be used for massive phishing and spam cam- 
paigns. We show how an attacker can easily use a bot- 
net by acquiring access to wifi-capable zombie hosts, and 
can use these zombies to target not just the local wireless 
LAN, but any LAN within range, greatly increasing his 
reach across heterogeneous networks. Last but not least, 
we discuss the use of Tracknets, city-wide wifi botnets 
for unauthorized tracking of user location and behavior. 


All three types of attacks, illustrated in Figure 1, 
are specific to wireless networks, and are based on the 
premise of dense wifi network deployment in urban set- 
tings. While most of the underlying vulnerabilities have 
been widely known for years, the amplifying power of 
densely deployed wifi networking has a profound impact 
on both the feasibility and the magnitude of the threats, 
suggesting that their importance may have been grossly 
underestimated. For instance, the susceptibility of open 
wireless LANs to spoofing has been well documented, 
but the need to be in physical proximity to the victim 
may have deterred the wider use of this attack so far. The 
ability to launch such attacks remotely is much more at- 
tractive, and can be scaled up by the use of coordination 
and a botnet infrastructure. 


As a result of underestimating these threats, no coun- 
termeasures are currently implemented. The mecha- 
nisms needed to thwart these attacks are in most cases 
either available but not actively used, or not available but 
relatively easy to implement. The fact that such mecha- 
nisms are not used is of particular concern. For instance, 
802.11i security mechanisms have been available for sev- 
eral years, and would address a large part of the problems 
described, but unfortunately they are currently not used 
by enough users. Similarly, the encryption of of MAC 
addresses would significantly increase the work-factor 
for Tracknets, but leaving the MAC addresses exposed 
was not deemed as a serious enough problem by the 
802.111 group. Related to the worm problem, content- 
based filtering is widely used and intrusion prevention 
is a mature technology, yet to the best of our knowl- 
edge, it has not been employed in access point wireless- 
to-wireless forwarding Raising awareness on the threats, 
using convincing, experimental evidence, is therefore at 
least as important as exploring and implementing possi- 
ble defenses. 


The main focus of this paper is in quantifying these 
threats, specifically in metro-area wireless networks. We 
rely primarily on publicly available maps of wireless ac- 
cess point locations, also known as wardriving maps, and 
attempt to derive estimates on the feasibility and effec- 
tiveness of the attacks using measurements and simula- 
tions. These estimates paint a grim picture on the ex- 
posure of current wireless networks to such attacks, and 
indicate that the risks are further increased as wireless 
penetration continues to grow as predicted. 

We also explore possible remediation strategies, most 
of which we have implemented and tested experimen- 
tally. In some cases, the defenses we have considered 
are just a matter of engineering, such as retrofitting re- 
active worm defense hooks and filtering capabilities in 
wifi gear. In other cases, countering the threat required 
novel techniques, such as those for detecting and pre- 
venting different variants of the basic spoofing attack — 
several such variants were discovered while pondering 
about possible defenses, and how attackers might try to 
circumvent them. 

While some of these techniques would become redun- 
dant if 802.11iis widely deployed, we cannot rest on the 
assumption that such deployment will happen anytime 
soon, particularly in light of usability concerns. For ex- 
ample, none of the recently announced municipal wire- 
less initiatives that we are aware of employ any form of 
protection, most likely due to the current perception of 
the risks of open wireless as well as the cost of managing 
accounts and passwords for large number of users — in 
one instance, 100,000 users and around 9M annual visi- 
tors. Furthermore, the choice of running an open wireless 
network may not always be a matter of ignorance or com- 
placency, but a concious choice; for example, to provide 
network access to guests, backup connectivity to neigh- 
bors, etc [26]. Whether temporary or long-term, we be- 
lieve that our supplementary defense techniques are use- 
ful for mitigating at least part of the threat. 


2 Wildfire worms 


The omnipresence and constantly improving capabili- 
ties of wireless mobile devices has attracted the regret- 
table attention of attackers, and in particular virus writ- 
ers. The “Cabir” virus, which first appeared in 2004, was 
the first instance of mobile malware [27]. The virus ex- 
ploited vulnerabilities in the Symbian OS and propagated 
through Bluetooth wireless connections. Experts predict 
the threat for smart phones and mobile devices is likely 
to increase significantly in the near future [40, 28]. 
Although such attacks may become prevalent in the 
years to come, in this paper we consider whether large- 
scale attacks are already feasible today on existing wire- 
less infrastructure using current technology. In particu- 
lar, we focus on worms that could spread entirely over 
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Figure 1: Dense wifi amplifies botnet threats. 


802.11 wireless networks, even if such networks are 
completely heterogeneous. In this environment, the main 
concern is not necessarily the infection of mobile devices 
such as PDAs and cell phones, but the existing large pop- 
ulation of laptops, desktops and other computers com- 
municating over wifi. We consider worms that propa- 
gate entirely over wireless connections, trying to infect 
other computers tuned to the same access point (AP) and 
also other APs within range. A notable fraction of hosts 
in such an environment may also be mobile, and could 
therefore carry the infection from one AP to another. 
In densely populated metropolitan areas, it is conceiv- 
able that such a worm could infect a large fraction of 
wireless-connected hosts, especially considering perva- 
sive vulnerabilities such as the ones exploited by Slam- 
mer [12], and recent browser vulnerabilities [13]. Such 
“client-side” vulnerabilities are of particular interest in 
a wifi setting, because unlike wired environments where 
a user needs to visit a malicious site to get exploited, it 
is often possible for an infected client to inject this kind 
of exploit via spoofing to any session between the target 
and a legitimate server. Considering the worst-case, a 
device driver exploit such as the recently discovered In- 
tel driver attack [24, 36, 42] could carry the worm across 
platforms, and would even bypass VPN software which 
often blocks all local, wireless connections. 

Although there has been considerable work in the lit- 
erature on how to deal with large-scale attacks on tradi- 
tional ‘“‘wired”’ networks, there are at least three differ- 
ences between wireless networks that require alternative 
solutions. First, wireless attacks can spread contagiously 
over wireless links based on proximity — similarly to real- 
world diseases — in contrast to the any-to-any communi- 
cation possible over the Internet. This renders previous 
models and analyses of Internet-based worm propagation 
ineffectual as they cannot be directly mapped to wire- 
less networks. Second, traffic in wireless networks is 
difficult to control using conventional methods, in lack 
of “hard” enforcement points such as firewalls between 
the communicating nodes. This is likely to significantly 
constrain the space for potential defenses. For instance, 


Figure 2: Simplified model of wildfire worm propagation. 


if such a wireless worm were to be unleashed today, it 
would most likely go undetected by most, if not all, cur- 
rent attack detection infrastructures [17, 2, 3]. Finally, 
devices (e.g. handheld devices in the near future) in 
these environments are likely to be significantly more 
resource-constrained, at least in contrast to traditional 
desktop settings, and it is therefore more difficult and ex- 
pensive to employ end-point security measures. 

This paper is not the first to examine the threat 
of worms in wireless networks. Other researchers 
have made attempts at deriving contagion models in 
MANETs, examining viruses that spread according to 
user mobility, or measuring propagation dynamics in a 
campus network (these studies are discussed further in 
Section 6). Our paper is first to explore, in depth, the 
problem of wildfire worms and proximity propagation 
in densely populated areas. Specifically, we discuss the 
threat of worms that propagate entirely over wifi connec- 
tions, and attempt to quantify the threat in terms of infec- 
tion prevalence and infection timescales. Providing reli- 
able estimates of potential infection prevalence is impor- 
tant for creating awareness on the severity of the threat, 
while the likely infection times are needed to guide the 
design of suitable countermeasures. Our analysis relies 
on simulated outbreaks of wifi worms driven by real- 
world data derived from wifi maps of large metropolitan 
areas around the world. Among other observations, our 
results suggest that a carefully crafted wildfire worm can 
infect all vulnerable wifi-connected computers in 80% 
of access points in some studied areas within 10-20 min- 
utes — timescales at which traditional defenses may not 
be able to react in a timely fashion. 

In this section, we describe the design and attack vec- 
tors of a wifi worm. The fundamental principle is that a 
wildfire worm relies on local, proximity-based propaga- 
tion within shared medium broadcast environment such 
as WLAN. 


2.1 Wifi worm propagation 


Figure 2 illustrates the propagation dynamics of wildfire 
worms. Three access points A, B and C provide wire- 
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less coverage to end users, e.g. mobile nodes 1-7. They 
could represent, for example, WLANs deployed at ad- 
jacent buildings. Note that overlapping usually exists 
between adjacent access points for both residential net- 
works (especially in densely populated cities) and corpo- 
rate wireless networks (to allow for continuous connec- 
tivity and seamless mobility roaming). 

Assume node | is the initial source of infection, i.e. it 
was infected previously at some other location before as- 
sociating with access point A. Once activated, the worm 
analyzes WLAN A and probes all victims in the neigh- 
borhood; hence node 2 and node 3 eventually get in- 
fected. Note that node 3 is under coverage of both A and 
B. Normally node 3 picks and associates with only one 
access point, which is decided by certain criteria such as 
wifi signal-to-noise ratio. A worm-infected node, how- 
ever, can gather a list of usable access points within reach 
and scan them for victims in the proximity. Effectively, 
the worm toggles association between usable WLANs to 
spread itself. Eventually all nodes in WLAN B and C are 
compromised through node 3 and nodes 5/6 respectively. 

Nodes at coverage intersection of access points are 
“bridges” that help propagate the worm. These nodes 
can be thought of as “connectors” in the small-world 
phenomenon hypothesis [44, 41]. Contrary to the con- 
text of traditional Internet worms in which node | could 
probe and infect node 7 instantly, propagation dynam- 
ics of wildfire worms are similar to gradual and local 
diffuseness of disease. Therefore, a major advantage 
and difference of a wildfire worm over a regular Inter- 
net worm is that a wildfire worm can propagate entirely 
locally within each connectivity area, and thus evade fire- 
walls and intrusion detection/prevention systems located 
at traditional enforcement points on the boundary be- 
tween the local networks and the Internet. 

Fertile ground for wildfire worms are wireless hotspot 
networks, which provide Internet access in public areas 
such as restaurants and airports, and private wireless net- 
works of home users in residential areas. For example, 
Singapore government is realizing a “Digital Singapore” 
with wireless hotspots available at every street corner 
where people can log onto the Internet and receive emails 
on the move. Section 2.6.2 evaluates whether wifi pen- 
etration in metropolitan areas is sufficient for sustaining 
the spread of a wifi worm. 


2.2 Mobility 


Presently, the wireless node population consists mostly 
of laptops, and to a lesser extent of PDAs and smart- 
phones (including wifi VoIP phones). The mobility pat- 
terns of wireless users can affect worm dynamics in three 
ways. First, mobility could compensate for sparse con- 
nectivity that may hinder wildfire-style propagation, as 
users carry the worm to networks previously unreach- 


able by the worm. This is not restricted to just the places 
where the user turns on the laptop, as Laptops can also be 
programmed to wake up periodically as the user moves 
from one place to another At the same time, user mobil- 
ity also helps worm propagation into protected networks, 
whether they use WEP or more secure WPA/WPA2 pro- 
tection, as the user will voluntarily (and perhaps even au- 
tomatically) authenticate to those networks. Finally, the 
worm could create fake access points to lure and infect 
mobile users. 


2.3 Open vs. Protected Access Points 


There is a significant number of publicly available 
“open” access points; the rest are protected with Wired 
Equivalent Privacy (WEP) encryption or Wifi Protected 
Access (WPA). A worm can propagate over unprotected 
wireless networks in the way shown in Figure 2. More- 
over, as a result of design and implementation flaws, 
WEP encryption is insecure. There is a handful of WEP 
attacks in the literature, e.g. weak IV attacks [30], 
keystream re-use [15, 22] and more recently fragmen- 
tation attacks [20] . These attacks are not just of theoret- 
ical value; they have been implemented into many prac- 
tical and efficient WEP cracking tools freely available 
on the Internet. Wepcrack [8] did a performance com- 
parison on some of such tools. Among them, Aircrack 
[1] is particularly powerful with a high success rate and 
relatively low cracking time that could vary between 5 
seconds to 1 minute. However Aircrack needs to spend 
considerable time to sniff and capture sufficient wireless 
packets before cracking attempt. For example, after an- 
alyzing wireless usage statistics at a university campus 
[7], we determine that it may take 1-2 hours on average 
to successfully crack WEP encryption. Instead of pas- 
sively sniffing packets, the worm could also employ ac- 
tive attacks e.g., discovering the encrypted version of a 
plaintext packet [8]. As for WPA, while not inherently 
weak, it is susceptible to bruteforce attacks if used with 
a weak password in the most common WPA/PSK con- 
figuration. Given the apparent susceptibility of the cur- 
rently available protection mechanisms, it seems likely 
that worms would consider carrying the additional pay- 
load of including cracking tools. 


2.4 Infection process 


In the design of a wildfire worm, we note that there are 
two possible ways to exploit vulnerabilities. The first 
approach, known as the “push method”, is to directly 
probe for an exploitable service and inject code to that 
service on clients just as traditional worms (e.g. DCOM 
RPC vulnerability on port 135 for Blaster worm). With 
the second approach, dubbed “pull method”, instead of 
relying on a service vulnerability, the attacker exploits 
vulnerabilities, such as browser vulnerabilities by per- 





326 


16th USENIX Security Symposium 


USENIX Association 


forming a man-in-the-middle attack. For example, the 
infected node can listen on the wifi and wait for the vic- 
tim to make a DNS request, spoof the response pointing 
to itself (or some other, unused address), pretend it is the 
web-server and respond with pages that include exploits 
such as the WMF exploit [13] or other exploits for IE 
and Mozilla that attempt to execute malicious code. ARP 
spoofing and TCP injection attacks may be used as well. 
We note that the distinction between worm and virus is 
blurred in this case, as propagation may require some 
form of user interaction, yet the attack is piggybacked on 
communication to a third party, rather than between in- 
fected and targeted host. The broadcast nature of most 
wireless setups makes “pull” attacks attractive for wild- 
fire worms as they can be exploited at a scale that was 
never possible for Internet worms. 


2.5 Proof-of-concept implementation 


We have implemented a proof-of-concept wildfire worm 
for both Windows XP and Windows Vista. This worm, 
dubbed Wildfire/A, has been submitted to security ven- 
dors for testing. The implementation of this worm was 
surprisingly straight-forward given the plethora of tools 
publicly available. 

The WLAN API available for both Windows- Vista 
and -XP facilitates the process of managing AP associa- 
tion and scanning. Through this API, the worm is able to 
actively scan for open “visible” APs and, in turn, asso- 
ciate with them. Once associated with an AP, the worm 
scans the local subnet for vulnerable machines. For 
this particular proof-of-concept implementation we only 
considered push exploits, namely, the chunked-encoding 
vulnerability found in the Apache Web server 1.22. The 
worm payload is packaged as a self-extracting archive 
that contains the libraries required by the WLAN API as 
well as a copy of the actual worm. We have confirmed 
that the worm operates as expected in a small scale ex- 
periment with 4 APs and 15 vulnerable hosts. 


2.6 Analysis 


As with all worms, wildfire worms need to exploit a vul- 
nerability to infect end-hosts. Unlike Internet worms that 
can effectively spread even if the vulnerable population 
is very small [48], wifi worms depend on the vulnera- 
bility being widespread. This raises two questions: what 
critical mass does a wildfire worm require to be effective, 
and whether there are indeed such pervasive vulnerabili- 
ties. 


2.6.1 Vulnerabilities 


To determine whether there is a significant number of 
pervasive vulnerabilities, we analyze vulnerability data 
from a variety of sources, including NVD [4], Securi- 
tyfocus [6], and other independent sources. We focus 
on remotely exploitable vulnerabilities in the default in- 


stallation of Windows XP Service Pack 2, between Au- 
gust 2004 (the Windows XP SP2 release date) and Jan- 
uary 2007. We classify vulnerabilities based on whether 
they can be triggered through direct injection (“push” ex- 
ploits) or through spoofing attacks as discussed in the 
previous section (“pull” exploits). Starting from basic in- 
formation available through the NVD database, we ver- 
ify the vulnerability information and derive further de- 
tails such as exploit availability, exploitation technique, 
disclosure date, and patch dates primarily from Security- 
focus archives but also other independent sources. 

For all the qualifying vulnerabilities, we attempt to get 
a rough estimate of the vulnerability window: the amount 
of time the vulnerability was known and not patched in 
the majority of hosts. Unfortunately, publicly-available 
information does not always give us an accurate timeline 
of exploitation time vs. disclosure time, and we there- 
fore have to make certain assumptions. In particular, we 
optimistically assume that by the time a vendor (in this 
case, Microsoft) releases an update, all hosts in the net- 
work are instantly updated and patched. In most (but 
not all) cases, the vulnerability is disclosed by the ven- 
dor only when the update is available. As such, it is not 
always possible to determine exactly when the vulnera- 
bility became known and to consider this as the start of 
the vulnerability window. 

In lack of more accurate data, we assume that the vul- 
nerability window starts two week before the update is 
issued, as Microsoft only posts updates every second 
Tuesday of each month. This is corroborated by Syman- 
tec which reported an average period of 13 days for the 
first half of 2006 between disclosure date of a vulnera- 
bility and the release date of an associated patch by Mi- 
crosoft [53]. 

The results indicate significant exposure to vulnerabil- 
ities in the default configuration over the last two years, 
accounting for more than 50% of all days in the to- 
tal period. Vulnerabilities of “push” type, i.e., that af- 
fect services and don’t need user interaction, were ac- 
tive for 105 days (11.89%) while “pull” type, i.e., that 
need user-interaction of some-kind, were active for 428 
days (48.47%). We believe this observation suggests a 
trend, in which server/services components seem to be 
relatively robust when compared to client components. 
This is especially alarming in the context of wifi worms, 
because they are particularly suited for exploiting such 
vulnerabilities, and their abundance may give them an- 
other evolutionary advantage over Internet worms. Over- 
all, we have found that 60% of the listed vulnerabilities 
had public exploits available for 391 days (44.28%) dur- 
ing the time period. 

Other analyses of vulnerability exposure for the years 
2004-2006 published on the Internet paint an even dim- 
mer picture for “pull” type attacks. For a total of 284 
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Figure 3: Spread of a wild-fire worm. 


days (78%) in 2006, exploit code for known, unpatched 
critical flaws in pre-IE7 versions of the browser was pub- 
licly available on the Internet, and there were at least 
98 days in which no software fixes from Microsoft were 
available to fix IE flaws that criminals were actively us- 
ing to steal personal and financial data from users [39]. 
For at least 256 days (70%) in 2005, Internet Explorer 
contained unpatched vulnerabilities where the exploit 
method had been publicly disclosed but was not neces- 
sarily being used, and for at least 38 days in 2005, IE was 
vulnerable to unpatched critical security flaws that were 
being actively exploited [38]. A fully patched Internet 
Explorer installation was known to be unsafe for 98% of 
2004, and for 200 days (54%) there was a worm or virus 
in the wild exploiting one of those unpatched vulnerabil- 
ities [11]. For Firefox, there were 56 days (15%) in 2004 
where a publicly known remote-code execution had not 
yet been thwarted with a patch [11]. 


2.6.2 Worm simulation 


To understand wildfire worm propagation, we simulate 
the outbreak of a worm in nine well-known US cities 
and Singapore. For this we relied on publicly available 
maps of Access Point locations from the Wigle.net [10] 
“wardriving” database, as well as empirically derived 
data for the city of Singapore. From these maps, we only 
consider open APs where the worm can spread without 
having to crack the encryption or the password. 

Available war-driving maps chart APs but not con- 
nected hosts, so we had to populate them by randomly 
distributing hosts around APs. Based on our war-driving 
measurements and assuming a pervasive vulnerability, 
we distribute an average of 0.5 hosts per AP with Pois- 
son distribution at an exponentially distributed distance 
of 10 m on average. We model effective AP range as 
omnidirectional with a radius of 90 m. 


Finally, we do not consider the possibility of bypass- 
ing the AP to directly infect hosts within range using low 
level techniques because these depend on the available 
device driver and may not be widely available. We also 
ignore host mobility except that we assume the epidemic 
starts from 50 random locations to avoid artificially con- 
fining the worm to a sparse disconnected portion of the 
city. 

The infection time for one hop is determined by four 
factors: scanning time, association time, IP acquisition 
time and transmission time. Based on our wildfire worm 
prototype, we assume a scanning and association time 
of about 1.5 seconds. We do not model DHCP interac- 
tion in our simulations as the worm can simply hijack an 
IP address. With an effective throughput of 14 mbps 
and 8 mbps for typical 802.11g and 802.11b networks 
respectively, the transmission speed is between 1 Mbytes 
and 1.7 Mbytes per second. Since the bandwidth will be 
shared among hosts, each host gets a transmission speed 
of a few hundreds kbytes/seconds. We assume a trans- 
mission speed of 100 kbytes/sec per host. For a worm 
size of 100K — which should be sufficient — the transmis- 
sion time is about | second. A simulated worm-infected 
node infects its neighbours sequentially using these pa- 
rameters. 

Each simulation consists of 20 runs; for each run we 
start the infection from 50 different randomly selected 
hosts. We collect the mean values across runs of infec- 
tion prevalence over time. 

Figure 3 is a plot of infection prevalence over time 
for a “push” worm. Dense cities are infected very fast: 
80% of New York and Chicago in less than 20 min- 
utes. San Francisco and Philadelphia are infected fast 
as well: about 50% of San Francisco and Philadelphia 
are infected in 45 and 11 minutes respectively. A wild- 
fire worm does not spread significantly in Los Angeles 
and Las Vegas, but on a longer time scale a worm could 
still spread with the help of user mobility. The worm can 
spread fast as long as there are enough APs to maintain 
connectivity, but high density may even bog down the 
worm in some cases. In absolute numbers, we see that an 
attacker could quickly gain access to ten’s of thousands 
of hosts in most cities. The attacker could start simul- 
taneous, independent epidemics in many cities using the 
Internet to infect a few seed hosts. 

As for pull worms, we briefly summarize the simula- 
tion results here without a figure. Their simulated spread 
is limited compared to push worms — prevalence of pull 
attacks is limited to 60% in 3 hours for New York and 
Chicago, but they are potentially more dangerous, as 
they can take advantage of more vulnerabilities. They 
are slower because the infection time must include wait- 
ing for the victim to offer an opportunity for infection in 
the form of a DNS requests or TCP connection. On the 
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other hand, the worm can wait in parallel for any vic- 
tim to become active. We use a very rough estimate of 
10 minutes for waiting time to get an idea of the time 
scales involved, acknowledging that some machines may 
have no browsing activity at the time. The pull worm 
also requires higher density since we assume a shorter 
range of 60 m. Weaker antennas and increased interfer- 
ence typically weaken client transmission characteristics 
when compared with APs. 

Overall, these time-scales suggest that automated de- 
fenses are crucial for defending against wildfire worms. 


3 Large-scale Wifi Spoofing 


One key property of open 802.11 networks is that they 
are built around a broadcast medium, where any wire- 
less station can transmit wireless frames, and can listen 
to all other frames transmitted on the network. This is 
reminiscent of shared Ethernet segments of the 90’s. 

This property makes wireless LANs susceptible to 
spoofing and injection attacks, as discussed extensively 
in the context of wired Ethernet (but effectively disap- 
peared with the emergence of switched Ethernet). The 
basic idea is that an attacker can monitor the communica- 
tion between hosts on the wireless network, or between a 
host on the wireless network and an external party. If the 
communication is not properly encrypted, the attacker 
can elicit session state through eavesdropping, and if the 
communication is not authenticated, he can then inject 
frames to one session endpoint pretending to come from 
the other session endpoint. 

Most protocols, such as DNS, DHCP and TCP are sus- 
ceptible to this attack. In the case of DNS, the attacker 
can watch for outgoing DNS queries and inject responses 
pointing to a host under his control. For TCP the attack 
is similar — all the attacker needs to know is the current 
state of the connection in terms of sequence numbers. At 
connection setup, he may even completely take over the 
connection by injecting the proper SYN-ACK, resulting 
in the legitimate endpoint being out of sync. Injection 
is also possible at any point in the connection as long 
as the attacker can time injection attempts to properly 
deliver TCP segments to the victim network stack. The 
DHCP protocol can be spoofed to have a victim use an 
IP address and default gateway that gives the attacker full 
control over all of his traffic. However, it may be less at- 
tractive than DNS and TCP spoofing as the attacker has 
to wait for the victim to refresh his DHCP lease, or else 
attack only hosts that have connected after the attacker 
has obtained access to the wifi network. 

While in the 90’s such attacks were seen as enablers 
for unauthorized access, in today’s threat landscape they 
are more likely to be used for “modern” attacks such as 
phishing, spam and exploit injection. In the previous sec- 
tion we briefly discussed how injection can be used to 


propagate a worm through client-side vulnerabilities. In 
this section we focus on spoofing primarily for the case 
of launching phishing attacks, and discuss ways to detect 
and prevent them. DNS spoofing is highly attractive for 
phishing as, for example, the attacker may set up a mock 
banking website that would relay manipulated requests 
to the real site in a man-in-the-middle fashion. We note 
that in this case, two-factor authentication cannot help. 
Similarly, TCP injection can be used to insert redirection 
instructions, advertisements, or spam to otherwise legiti- 
mate Web pages. Sophisticated attacks can even subvert 
user’s services, such as using a victim gmail account, etc. 

The use of such techniques in wifi for phishing has 
been documented previously. The so-called “parking lot 
attack” involves the attacker being in physical proximity 
to the target network. While this attack may be interest- 
ing by itself, we are not aware of any extensive use of 
this technique. One main disadvantage is that the physi- 
cal proximity constraint increases the risk to the attacker, 
especially in environments with pervasive CCTV cover- 
age that can be used for forensics. In the context of this 
paper we explore how proximity enables remotely con- 
trolled bots to be used for such activities. In this case, 
the attacker can acquire access to a wifi-enabled host lo- 
cated in a wifi-rich location. In contrast to traditional 
Trojans, the attacker need not try to elicit information 
from the owner of the actual machine that is being ex- 
ploited. Rather, the attacker may perform spoofing on 
any wireless network within range from the host under 
his control using channel hopping and/or temporary as- 
sociation for the duration of the attack. The dense use of 
wifi in metropolitan areas makes this model quite attrac- 
tive, as it may significantly amplify the attacker’s capa- 
bilities. 


3.1 Analysis 


To determine the effectiveness of spoofing attacks in 
terms of scale we rely on the same publicly-available wifi 
maps used for analyzing wildfire worms. We attempt to 
get a rough estimate of the number of access points that 
hosts on the map can connect to. As we only have access 
point locations, we add hypothetical hosts within range 
from each access point. We distribute 1 host per AP and 
assume a communication range of 60m. 

We compute the number of neighboring APs for each 
host, that is, all APs within range excluding the AP it is 
directly connected to. We consider only “open” APs that 
do not use any wireless security protocol, even though 
the attacker may well be able to crack into WEP-enabled 
networks using well-known attacks and tools. 

The results for our analysis on 10 different metro areas 
are shown in Figure 4. We see that in half of the cities 90- 
99% of all hosts can connect to at least one more neigh- 
boring AP; 20-50% of hosts can connect to at least 10 
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Figure 4: Number of WLAN networks observable from 
random hosts in metro areas (range 60 m). 


additional APs; and a small but non-negligible number 
of hosts, as high as 10% in Chicago are within range of 
more than 100 APs. Unsurprisingly, the results are worse 
for Chicago, which seems very densely populated, and 
less so for relatively sparse areas. 

Overall, the results confirm our fear that controlling 
wifi-enabled hosts in densely populated areas can be 
highly attractive to attackers. 


4 Wifi tracknets 


The proliferation of city-wide wifi networks has already 
raised serious concern over privacy implications. Privacy 
advocates fear that wifi networks can be used to record 
location information for the operating ISPs, their part- 
ners, and possibly law enforcement, raising concerns that 
wifi can be used to track general user behavior in a ’Big 
Brother” fashion. 

However worrying this scenario might appear, it can 
be classified as a mere nuisance when compared with 
the possibility of anyone being able to remotely set up 
a tracking system, without even having to set up physical 
infrastructure. Such systems, which could be termed as 
Tracknets can be deployed using a reasonably sized bot- 
net, providing a user-tracking mechanism that can oper- 
ate across wireless network boundaries. Criminal gangs 
are known to operate marketplaces for bots, sometimes 
with specific features such as high bandwidth and CPU 
power, priced between $1 and $40 per compromised PC 
according to security exprerts who have monitored IRC 
chat room echanges [54]. It is conceivable that attributes 
such as wifi connectivity, and location within a metro- 
area could be added to the list of features to facilitate 
attacks such as those described here. 

Such a botnet can then track location information [16], 
possibly coupled with user-profiles that can span across 
heterogeneous wireless LANs. The location of the zom- 
bies comprising the bot can be infered from the ESSID 
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Figure 5: Spoofing defense space. 


of their AP using public wifi maps. (In fact, this service 
is already provided by companies such as Navizon and 
Skyhook.) The number of users that can be tracked us- 
ing Tracknets and its coverage are commensurate with 
the size of the botnet population and the amplifying ef- 
fect of proximity, similar to the spoofing threat discussed 
in the previous section. 

Several services can leak significant amounts of 
privacy-sensitive information. This information can, in 
turn, be used for targeted Phishing and spam attacks, 
blackmail, and for pre-attack reconnaissance such as 
building hit-lists. In addition to high-information-leak 
vectors, several techniques can provide personal infor- 
mation at a lower granularity that might not be able to 
distinctly identify individual users but can be used to 
classify sets of users according to broader set of crite- 
ria such as OS version version, wireless driver informa- 
tion and general browsing behaviour. In this section we 
briefly examine some of the most obvious tracking vec- 
tors. Our investigation is far from exhaustive and only 
scratches the surface of possible ways that users could be 
tagged and tracked. Nevertheless, the vectors we discuss 
show at least one set of techniques that seem threaten- 
ing enough by themselves, and may be representative of 
other approaches. 


MAC address _ The obvious way to track users across 
heterogeneous WLANs is to use the MAC address as 
unique identifier. Trackers can use this information to 
correlate any other behavioral information to a MAC 
address to easily create profiles. Fortunately, although 
MAC addresses are permanent by design, there exist a 
number of mechanisms that allow users to change the 
identifier. Gruteser et al [32] introduce the idea of short- 
lived disposable MAC addresses as a technique for the 
reduction of the effectiveness of location tracking. How- 
ever, randomizing MAC addresses often leads to prob- 
lems. For example, several ISPs use MAC addresses 
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to map IP addresses. Also, some software licenses are 
bound to a specific MAC address. Furthermore, even in 
the presence of such techniques, user profiling can still 
effectively track users in dense urban environments. In 
our system, we use MAC addresses as temporary identi- 
fiers for correlating information that will be used to cre- 
ate user profiles as described below. 


Live bookmarks —- RSS___Live bookmarking is a new 
popular method for displaying web feeds as bookmarks. 
Its popularity surged when it was introduced in Mozilla 
Firefox 1.0 back in 2004 and can now be found in several 
other popular web browsers such as Apple’s Safari and 
Internet Explorer 7. Live bookmarks subscribe to user- 
defined RSS feeds and are periodically updated so as to 
display the latest articles. The ability to customize feeds 
along with the inherent periodicity of the updates make 
Live Bookmarks susceptible to eavesdropper profiling. 
In particular, as users subscribe to more RSS feeds they 
inadvertently create distinct profiles that can be used to 
track them. Given the wide range of tools available for 
parsing RSS feeds, it is trivial for a tracker to parse the 
feeds so as to extract user personalization in addition to 
RSS subscription information. Worse, by using traffic 
analysis to identify such communications based on their 
periodicity and creating a signature based on packet size 
distributions, an attacker could possibly track users over 
encrypted WLANs, however, we have not investigated 
this scenario further. 

Tracknet bots would collect and parse all requests to 
RSS feeds. The information derived from the feed is then 
associated to an individual node. The node is temporarily 
identified by IP and MAC address for the current session. 
Any other information that is collected from the partic- 
ular node is collected in a tracking tuple that correlates 
all other pertinent fields that aid in the identification of 
the node. In order to reduce the number of identification 
false positives we correlate the RSS fingerprint with the 
base station ESSID. Distinct fingerprints that appear at 
the same location (e.g. home or workplace) might point 
to a distinct identify with a higher level of confidence. 


Location tracking Collaborating bots can use radio 
signal characteristics of WLANs to determine a user’s 
location with relative accuracy using triangulation tech- 
niques. This information, in combination with other ex- 
tracted personal information can lead to considerable pri- 
vacy leaks. Specifically, bots can use this information to 
infer user behavior. For example, information on enter- 
tainment habits, political orientation, medical informa- 
tion can be potentially derived. 


Other services Beyond the mechanisms described 
above, there are numerous other protocols and services 
that leak significant personal information. For example, 
numerous Instant Messaging (IM) system do not employ 


encryption so all user identification information is avail- 
able to eavesdroppers. Although this information might 
not be significant on its own, when it is correlated with 
other sensitive information, it can be used to construct a 
distinct user profile. Other systems that can be used to 
fingerprint user behavior are the mail servers that users 
connect to, information from other networking protocols 
such as NETBIOS and AppleTalk and even which VPN 
servers a user connects to. 

The growing popularity of Google and other online 
service portals, has moved a number of user services 
to central aggregated locations where users can check 
their RSS feeds and email. Although this configura- 
tion changes the network fingerprint that is emitted by 
services it does not reduce the amount of information 
that is leaked. For example, the Google homepage in- 
cludes links to personalized RSS feeds including the 
user’s email address in plain text, which often points 
to a user’s real identity, e.g., john. doe@gmail.com. 
This information can be readily used to create very accu- 
rate user profiles since a tracker can intercept these un- 
encrypted HTTP transfers. 

Another serious vector of information leak is (to no 
surprise) the use of cookies. Cookies are used exten- 
sively as a mechanism for servers to identify users and 
track their access. The threat of Cookies to user privacy 
has received considerable attention in the literature [23]. 
In the context of tracknets, the exchange of Cookie in- 
formation can be used to extract personalized user infor- 
mation based on both the contents of the Cookies and 
their transmission fingerprint. For example, Google, a 
company synonymous with Internet search uses cookies 
that expire in 2036. The cookie uses a 16-digit identifier 
to track user preferences and, inevitably, track user be- 
havior. Given the popularity of the search engine, it is 
not unreasonable to assume that a large percentage of the 
user population will emit this identifier during its life- 
time, adding another mechanism for user tracking. 

The Dynamic Host Configuration Protocol (DHCP) is 
a ubiquitous protocol used for automating network con- 
figuration. Unfortunately, there is no privacy protection 
for DHCP messages, so an eavesdropper who can mon- 
itor the link between the DHCP server and requesting 
client can discover the information contained in this op- 
tion. For example, the following snippet illustrates the 
kind of information that can be derived from a DHCP 
request. Information on the types of services and more 
importantly hostname information is made readily avail- 
able to eavesdroppers. 


Client IP: 10.50.16.205 
Client Ethernet Address: 
Vendor-rfc1048: 

DHCP: REQUEST 


00:17:£2:40:61:65 


PR: SM+DG+NS+DN+NI+NITAG+SLP -DA+SLP - SCOPE+LDAP+T252 


MSZ:1500 
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CID: [ether] 00:17:£2:40:61:65 
LT:7776000 
HN: "alamak" 


We collect and correlate the information derived from 
DHCP headers. In particular, we are interested in user- 
identifying information such as the user’s hostname. 
This information might appear innocuous but is often 
linked to personal information such as the user’s name 
or company information. Again, in this case we asso- 
ciate DHCP-derived information with the base station’s 
ESSID. 


4.1 Experimental analysis 


We determine how effective an attacker can be in track- 
ing users using a botnet consisting of wifi-enabled hosts 
within a metropolitan area. For this purpose, we rely 
on the same wifi maps used for analyzing the worm and 
spoofing attacks. The effectiveness of a tracknet can be 
expressed in terms of coverage, that is, the fraction of 
wireless LANs that are within range from a given set of 
subverted nodes participating in the tracknet. The feasi- 
bility of a tracknet also relates to the number of subverted 
nodes that the attacker needs to obtain in order to achieve 
a certain level of coverage. As the attacker may have lit- 
tle control over which hosts to subvert (or buy access to) 
and where they are located, in each experiment we as- 
sume a random subset of hosts on the wifi map. As MAC 
addresses are exposed even when the network uses WEP 
or 802.111 encryption, we consider all access points re- 
gardless of whether they are open or protected — in other 
words, a tracker can monitor any network within range. 

The results for 10 metro areas are shown in Figure 6. 
We observe that the fraction of subverted hosts needed 
to track users is relatively modest: with hosts on just 1% 
of all APs in a dense area, a tracknet can cover between 
5% and 40% of all traffic. As expected, full coverage is 
not easy to achieve, but having trackers on around 7% 
can reach between 30% and 80% coverage. As with the 
worm and spoofing threats, the high density of Chicago 
and NYC make them particularly susceptible to this at- 
tack: less than 1,000 zombies are sufficient to cover 40% 
of the APs. 

At the time of writing this paper, all MAC addresses 
are exposed, but it is worth investigating whether using 
disposable MAC addresses would help address this prob- 
lem. As discussed previously, we are particularly con- 
cerned about other high-information-leak profiling tech- 
niques that could essentially offer uniquely identifying 
information equivalent to a MAC address. We focus on 
RSS feeds as one emerging source of leaks, and try to 
quantify the ability of an attacker to use this information 
for tracking purposes. For this purpose, we have obtained 
from an online service provider the set of RSS feeds that 
users are subscribed to, for around 100,000 users. The 


size of the dataset is important as we seek to measure the 
uniqueness of each RSS profile. We therefore measure 
for each user, whether any other users have the same ex- 
act profile, in which case we say that we have a profile 
collision (which could make tracking information am- 
biguous and confusing to the attacker). As some users 
have empty or very small profiles, we expect more colli- 
sions there, and we therefore compute collision statistics 
for those users with at least a minimum number of feeds 
in their RSS set. 


The results are presented in Figure 7. As expected for 
a minimum RSS set of zero, that is, no constraints, the 
fraction of users with colliding profiles is around 30% — 
most of them are users with an empty profile. Remov- 
ing only those that have an empty profile, that is, focus- 
ing on a minimum set of one entry, the collision proba- 
bility is 0.02 to 0.07, significantly lower and reasonable 
enough to allow a tracknet to identify a user with high 
confidence, especially given that this information can be 
correlated with other data. For users with more substan- 
tial RSS feeds, the collision probability is between 0.002 
and 0.01, indicating highly unique profiles. The scaling 
behavior of collision statistics is of particular importance 
here: we see that collision probability increases with the 
number of RSS profiles in the dataset, yet the difference 
seems to be small between a database of 50K users and a 
database of 100K users. If a tracknet is supposed to cover 
a whole city, the number of profiles can be much larger 
than the set we considered here, but our results suggest 
that collision probability is unlikely to worsen signifi- 
cantly. Furthermore, when a user’s RSS fingerprint is 
coupled with location information such as mobility pat- 
terns, this set can be reduced even further. 


5 Defense strategy 


The threat of wildfire worms and large-scale spoofing 
can be reduced significantly with the use of existing wire- 
less security standards such as WPA/WPA2, with strong 
encryption and hard-to-guess passwords. Unfortunately, 
despite the wide availability of such techniques, users do 
not seem to employ them. Even if this is simply because 
there have been no large-scale attacks yet, the use of 
passwords hinders usability and robustness. It is likely 
that even if such measures are implemented, in many 
cases the passwords are not going to be strong enough 
to resist brute force attacks. As such, it seems worth- 
while investigating alternative, reactive defenses specific 
to the attack vectors discussed so far. In the remainder of 
this section we discuss such defenses, as implemented in 
a prototype system for automated defense against wild- 
fire worms and spoofing attacks based on the Linksys 
OpenWRT [5] router and optionally using an external 
controller and centralized threat analysis. 





332 


16th USENIX Security Symposium 


USENIX Association 


* Bay 
0.9 * Chicago 

* Dallas 
0.8 * Los Angeles 
* Las Vegas 
07 ~ New York 
* Philadelphia 
Seattle 
San Francsico 
* Singapore 


Fraction of APs Within Range 
oo 9 8S SF 
‘ake nD ao SS a a 


° 


I I I I I I I I I i 
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 


Fraction of Subverted hosts 


Figure 6: AP coverage of a given fraction of random bots 


in a metro area assuming a range of 60 m. 


5.1 Wireless IPS 


In our implementation, we have adapted the snort 
IPS [46] to run on OpenWRT. While previous implemen- 
tations have used snort to filter traffic between the wire- 
less network and the external ethernet connection, our 
implementation disables the normal low-level wireless- 
to-wireless forwarding and uses ebtables and IPtables 
to redirect traffic through userland where it can be pro- 
cessed by snort. 

As APs typically have limited computing resources, 
it may not be possible to have a fully fledged IPS run- 
ning on them. Increasing their capabilities may also be 
prohibitively expensive. There are at least two possible 
options to address this problem. The first option is to use 
only a subset of the signatures, most likely signatures 
for attacks against vulnerabilities that may not be uni- 
versally patched yet. The second option is to implement 
the IPS functionality in a centralized wireless controller, 
and have the APs forward all local traffic for inspection 
before retransmitting to the wireless medium. 

The main advantage of using a wireless controller is 
that it provides flexibility for devoting more resources 
to traffic inspection. It is also consistent with industry 
trends towards cheap, “dumb” access points managed 
by a wireless switch. However, none of the wireless 
switches we are aware of provide any filtering capabil- 
ities for internal WLAN traffic such as wildfire worms. 
In our case, the additional wireless-to-wireless IPS func- 
tionality is implemented as a standalone wireless con- 
troller. This functionality can be retrofitted into wireless 
controllers or implemented as part of a secondary con- 
troller — protocols for AP to controller communication 
are being standardized, and thus interoperability is likely 
to be achievable. 


For zero-day attacks for which there are no signatures, 
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Figure 7: RSS set uniqueness. 


we rely on honeypot feeds from access points back to an 
analysis center. There, we use the Argos system [45], 
which uses dynamic taint analysis to trap the execution 
of remotely-injected code, for detection coupled with our 
custom signature generation system. Our system col- 
lects packet trace samples corresponding to the exploita- 
tion attempts detected by Argos and then uses a heuris- 
tic for generating network-level attack signatures in the 
form of simple patterns. The heuristic tries to identify 
a substring that is sufficiently large, sufficiently frequent 
in the attack samples and sufficiently infrequent in be- 
nign traffic. This last part is important for addressing 
concerns about false positives as well as attempts to ma- 
nipulate signature generation for denial-of-service pur- 
poses. Our implementation uses a novel inverse indexing 
scheme on previously collected packet traces. While in 
our test setup these traces are maintained centrally at the 
threat center, it is conceivable that such testing can be 
performed at each site independently. These signatures 
are then installed on the AP or the wireless controller as 
snort sensor rules. 

Filtering of internal WLAN traffic assumes that the 
worm does not tamper with the wireless device driver 
and firmware. If such tampering is possible, the at- 
tacker may spoof access point transmissions directly — 
for which tools are publicly available [9], and bypass 
filtering mechanisms applied to traffic relayed over the 
AP. The AP can detect attempts to impersonate it as long 
as it can pick up the messages sent by the attacker, but 
this leads immediately to another attack scenario: the at- 
tacker can hide his emissions from the AP by tuning the 
wifi radio power or using directional antennas so that the 
spoofed packets can reach the victim, but cannot reach 
the AP (or external detection device). We refer to this 
as the whisper attack. The attack seems difficult to en- 
gineer, as it requires both the low-level driver/firmware 
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hacks of the basic 802.11 spoofing attack, as well as care- 
ful tuning of the radio. Unfortunately, newer chipsets 
provide improvements in power control, and it is likely 
that the attacker can easily find the “right” power setting 
to launch the attack by probing both the victim and the 
AP with different power settings, all controlled through 
the driver API. In some cases, the relative positions of 
AP, victim, and attacker may prevent this attack. In ad- 
dition, not using an AP to relay frames limits the com- 
munication range. Using power control to evade the AP 
may limit the range even further, to the point where it 
may become impractical to perform whisper attacks in 
the context of the massive attacks we discussed. 


5.2 Spoofing defense strategy and attack-defense 
co-evolution 


Assuming WPA and VPN solutions comes with a con- 
siderable usability cost; we investigate lightweight alter- 
natives. Interestingly, our exploration of defenses against 
spoofing attacks has revealed a small arms race. In par- 
ticular, while developing defense techniques we discov- 
ered several new variations of the attack, each defeat- 
ing one of our countermeasures. In this section, we dis- 
cuss the attacks, the countermeasures and present results 
evaluating their effectiveness. These findings are sum- 
marized in Figure 5. We focus on DNS spoofing for sim- 
plicity, but in most cases the attacks and countermeasures 
are similar for other protocols. 


5.2.1 


As discussed previously, the simplest form of DNS 
spoofing involves the attacker lurking for DNS requests 
to the target site, and then injecting a fake DNS response 
pointing to a site under the attacker’s control. It seems 
straightforward to defend against this attack through the 
use of ingress filtering at the AP. Ingress filtering ensures 
that all traffic broadcast by the AP on the wireless net- 
work is checked in terms of IP address and the inter- 
face on which it is received. That is, traffic originating 
from the wireless network should have IP addresses on 
the local wireless network. (Similarly, but less relevant 
here, traffic from the external network should not have 
an IP address on the internal network.) A DNS request is 
usually sent to a resolver outside the wireless LAN, and 
therefore the DNS response is expected from an external 
address. A spoofed response is trivial to detect, as it ar- 
rives on the AP from the wireless interface and has an 
external IP address. 


Wireless ingress filtering defense 


5.2.2 External collaborator attack 


A variation of the spoofing attack that circumvents 
ingress filtering involves the use of an external collab- 
orator. In this variation of the attack, the attacker is again 
eavesdropping on the wireless LAN lurking for DNS re- 
quests, but instead of sending the spoofed response from 


the wireless LAN, signals another host on the Internet 
to send a spoofed response to the victim. Being able to 
eavesdrop is crucial, as it allows the attacker to relay the 
needed DNS identifier and port number information to 
the remote collaborator. 

There are two constraints for the attacker that make 
this attack more difficult. First, the remote collabora- 
tor needs to be able to send packets with the source IP 
spoofed. Unfortunately, a recent study [18] shows that 
spoofing is still possible on more than 30% of hosts due 
to the limited use of source filtering. Second, the remote 
collaborator needs to send the spoofed DNS response be- 
fore the legitimate DNS response arrives. Thus, the at- 
tacker would need to locate a collaborator that is closer 
by in terms of round-trip times. 


5.2.3. Packet rewriting defense 


One way to defend against this attack is to rewrite pack- 
ets as they flow through the AP to the outside world, 
mapping the DNS id and port number, TCP sequence 
numbers, etc., to a different space, then doing the cor- 
responding inverse mapping on packets on the way back. 
The eavesdropper only knows the internal representa- 
tion of those identifiers and cannot relay the necessary 
information to the external collaborator. Any spoofed 
response from the external collaborator will be trans- 
formed to have an identifier that will result in the re- 
sponse getting dropped by the victim, making the attack 
ineffective. 

The mapping can be done using either a hash func- 
tion, or a state table, and is robust as long as the map- 
ping is unpredictable. In the case of hashing, we need to 
use a keyed hash, with the key being the destination IP 
address, to prevent the attacker from using a third-party 
DNS server to map out the key space. The choice be- 
tween state table and hash function is not always clear, as 
it involves space-time tradeoffs. If the hardware provides 
cheap hashing, then it may be preferred. In our Linksys 
OpenWRT implementation the use of a state table was 
more efficient as hashing introduced a high per-packet 
cost that turned the technique into a bottleneck. 


5.2.4 802.11-level spoofing attack 


As discussed in the context of wifi IPS, a sophisticated 
attacker can circumvent the ingress filtering defense by 
violating the 802.11 protocol to transmit frames directly 
to the victim. The AP can detect this by monitoring 
for transmissions that it did not send. However, it can- 
not detect the whisper attack discussed earlier, where the 
attacker tunes the wifi radio power so that the spoofed 
packets can reach the victim, but cannot reach the AP (or 
external detection device). 

When filtering fails, the next best option is to detect 
and forcibly abort the attack. We pursue this direction in 
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the next section. 


5.2.5 Whisper attack detection 


We have developed a set of defenses based on the detec- 
tion of abnormal combinations of network events. For 
example, to detect the injection of a DNS reply, we use 
bookkeeping of request-reply pairs to flag excess, incon- 
sistent replies. We also raise an alert when a host ap- 
pears to retransmit requests after having received replies, 
sO we can prevent a situation where the attacker keeps 
inserting a fake request, just before the legitimate reply 
for the previous request arrives, in order to maintain the 
request-response balance. 


While there are no visible duplicate replies in case of 
a whisper attack, the AP may still detect the attack indi- 
rectly. A solution for HTTP is to extract from each HTTP 
connection the server hostname from the corresponding 
mandatory HTTP header and the server address from the 
IP header, and compare this pair against the hostname 
and IP pairs extracted from observed DNS replies. If a 
reply has been whispered, no DNS reply will match the 
HTTP header and the attack will be detected. 


We have evaluated our technique for detecting whis- 
per attacks against 41,426 DNS and 339,317 HTTP re- 
quests generated by 65 IP addresses over a period of a 
week. We obtained 18 alerts (6 unique web sites), all 
of them false, corresponding to a false positive rate of 
0.53 x 1074 of all HTTP requests. For the same trace 
we observed zero excess DNS replies. We further eval- 
uated only our first technique for detecting excess DNS 
replies against 43,272,448 DNS requests obtained over 
a period of more than | month by instrumenting an en- 
terprise network with about 400 users. We obtained 22 
alerts, all of them false, corresponding to a false posi- 
tive rate of 0.5 x 10~”. Looking deeper into the alerts 
revealed a Content Delivery Network that is employing 
spoofing, probably for server selection. 


Once an attack is detected, it has to be blocked. How- 
ever, given two inconsistent DNS responses, the detector 
cannot directly distinguish which one is legitimate and 
which one is spoofed. Doing a secondary lookup is one 
option, but the wide use of load balancing, particularly 
for popular services, implies that the secondary lookup 
may not always agree with one of the two inconsistent re- 
sponses. A more relaxed check against the network pre- 
fix is also unlikely to help in the general case, as server 
replicas may not be co-located. In lack of any other sat- 
isfactory solution, our current implementation blocks the 
victim and redirects him to a warning page, notifying him 
of a potential spoofing attack, and giving him the option 
to proceed (and re-issue the request) through temporary 
HTTP redirection. 


6 Related work 


With the growing popularity of mobile devices, mal- 
ware targeting wireless environment have started to 
emerge [27, 29]. This new security challenge has re- 
cently gained some attention from the research commu- 
nity. 

A study related to ours is the one by Tsow et al [55]. 
The authors suggest that attackers could drive around a 
city taking over vulnerable wireless home routers. Sim- 
ilar to our study, the threat is amplified by dense wifi 
deployment, as attackers can take over hosts at a higher 
rate. However, the attack depends on vulnerable access 
points, and requires the physical presence of the attacker 
for driving around to find vulnerable routers. The attacks 
we discuss in this paper can all be launched remotely, 
and therefore easier and less risky for the attacker. 

Anderson et al. [14] analyzed the speed of worm con- 
tagion over campus-wide wireless networks. They de- 
veloped a worm simulation using real data from Craw- 
dad, e.g. user distribution, AP distribution and user mo- 
bility, to realistically study the dynamics of a mobile 
worm. However their results are constrained to dynam- 
ics of mobile worm at relatively small scale of a univer- 
sity campus with mobility as the major factor for worm 
spread. In contrast, our work has investigated big cities 
and metropolitan areas at much larger scale with wardriv- 
ing data around. We have identified a much larger threat 
e.g. infection completion in the order of minutes whereas 
Anderson at al. [14] predict a few hours to infect just 
the campus. The main difference is that wildfire-like 
propagation—not just user mobility, is the key attack 
vector in our work. It is also unclear whether their de- 
fense proposals could be proven effective given recent 
major changes of wifi usage pattern. 

Beyah et al. [19] discuss a worm that spreads by in- 
fecting users sharing the same hotspot. They use epi- 
demic models to simulate its spread and find it can in- 
fect a million users worldwide over the course of a year. 
Again the main difference is that the simulated worm re- 
lies on user mobility, but we show using wardriving data 
that mere density is sufficient in metropolitan areas lead- 
ing to much faster spread. 

Su et al. [52] investigate worm infections in a blue- 
tooth environment. They expect Bluetooth to outnumber 
wifi devices by a factor of 5 and predict large scale epi- 
demics, but the short range of bluetooth again implies 
slower, mobility-based spread. Cole et al. [25] use epi- 
demic models and simulations to discuss requirements 
for worm mitigation in tactical battlefield MANETS. 

Stamm et al. [49] discuss remote attacks on routers 
that can be used for large-scale pharming and can also 
spread viraly. We, too, discuss pharming as one of the 
potential abuses of dense, weak wifi deployments — ex- 
ploitable in a different way but to a similar extent. 
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Mickens and Noble [43] propose a framework called 
probabilistic queuing to model the epidemic spreading in 
mobile environment, which aims to treat node mobility 
as top priority. Their simulations showed that the proba- 
bilistic queuing model could achieve more accurate pre- 
diction than standard Kephart-White framework in many 
cases. However, this work assumes random waypoint 
model for user movement and does not take into account 
realistic user mobility patterns. 


Henderson et al. [33] analyzed extensive network 
traces from mature corporate WLANs and various uni- 
versity campuses and observed dramatic changes in wire- 
less usage. Indeed, all these changes are favorable for the 
spread of a wifi worm. First, users now run a wide vari- 
ety of applications such as peer-to-peer, multimedia and 
VoIP services, instead of the dominance of web traffic so 
there are higher chances of a worm exploitable vulnera- 
bility. Local traffic in the WLAN exceeds remote traffic, 
i.e. users within the same organization exchange data 
more than before. This would help the worm to detect 
and probe all wireless neighbors within its reach. The 
study also shows that wireless users are also surprisingly 
non mobile, half of which remain at home for 98% of 
time. 


In a similar approach, Hsu and Helmy [34] found that 
there exists a preference of wireless user association: 
most users only visit a small portion of access points, 
i.e. the ratio of visited access points hardly changes 
even though popularity of WLANs increases by years — 
this is invariant user characteristics. There is a repeti- 
tive pattern of user association over days, i.e. there is 
a high probability that a user reappears at the same ac- 
cess point at a certain time every day. This is quanti- 
fied as *network similarity index”. Therefore a mobile 
worm could distinguish itself from traditional internet 
worm by self-activating at the time where most mobile 
users are active. This is also contrary to the general as- 
sumption and over-simplification that users are always 
ON with no preference on association patterns; conven- 
tional randomly generated synthetic mobility models are 
insufficient. Another recent trend is that a mobile node 
stays online on average 87.68% of its life (i.e. its exis- 
tence in the wireless network). That is to say, people now 
tend to use WLAN as a replacement for wired network 
and keep their laptops constantly connected (instead of 
old style of establishing only when needed). A modern 
paradigm shift from WLAN as temporary connection to 
always-on permanent connection. Macro mobility: users 
have small converage in all environments (campus + cor- 
porate): typically only associate with 1.1% to 4.52% of 
total APs in their corporation. Each user has very few 
APs where it spends most of its online time. 

Blinn et al. [21] monitored five weeks of Verizon wifi 
hotspot network in Manhattan. They observed that far 


more cards associated to the network than logged into it. 
Most clients used the network infrequently and visited 
only few APs. Therefore hotspot are locations visited 
occasionally” rather than ’primary places of work”. 


Kim et al. [37] extracted a mobility model from real 
user traces. Speed and pause time follow log-normal dis- 
tribution and direction of movements closely related to 
road directions. Again, most of laptop clients are NOT 
very mobile, so this paper relied on VoIP users to extract 
mobility model. The type of mobile device being used 
can influence its user’s mobility: a laptop would tend to 
tie its user to his workplace whereas a PDA/VoIP user 
would move as he would normally. The reasons could 
be due to weight, size and nature of use of the device. 
A mobility model for laptop users should reflect relative 
weightage of immobility and mobility. 


Staniford et al. [51] describe the risk to the Internet 
due to the ability of attackers to quickly gain control of 
vast numbers of hosts. They argue that controlling a mil- 
lion hosts can have catastrophic results because of the po- 
tential to launch distributed denial of service (DDoS) at- 
tacks and access any sensitive information that is present 
on those hosts. Their analysis shows how quickly attack- 
ers can compromise hosts using “dumb” worms and how 
“better” worms can spread even faster. In subsequent 
work [50], the same authors show how a worm using pre- 
compiled lists of IP addresses known to be vulnerable 
can infect one million hosts in half a second. They also 
envision a Cyber “Center for Disease Control” (CCDC) 
for identifying outbreaks, rapidly analyzing pathogens, 
fighting the infection, and proactively devising methods 
of detecting and resisting future attacks. The metropoli- 
tan wifi environment offers another opportunity for at- 
tacks to occur that may not be covered by defenses built 
for Internet worms. Our work also provides estimates of 
propagation speed similar to the above studies. 


The issue of location privacy in a wireless setting has 
been examined in literature [35, 16, 31]. These system 
focus attention on protecting physical location privacy 
based against signal triangulation techniques and pro- 
tecting against source location in sensor networks. More 
closely related to our work, is the work by Gruteser et al 
[32]. The authors introduce the idea of short-lived dis- 
posable MAC addresses as a technique for the reduction 
of the effectiveness of location tracking. Our work shows 
that even in the presence of such techniques, user profil- 
ing can effectively track users in dense urban environ- 
ments. Saponas et al. [47] describe a prototype surveil- 
lance system that can track people wearing the widely 
available Nike+iPod sensors. Tracknets could be ex- 
ploited in similar scenarios to track people carrying any 
type of device whose traffic can be observed by wifi re- 
ceivers, such as wifi-enabled smart-phones. 
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7 Concluding remarks 


The increasing use of wireless technology and particu- 
larly wifi is likely to soon attract the attention of attack- 
ers, as attackers evolve and explore ways to exploit new 
technology to their advantage. This paper discusses a 
range of “modern” threats specifically tailored to metro- 
area wireless networks: wildfire worms that spread topo- 
logically due to infected hosts being able to carry the 
worm from one wireless LAN to another; large-scale 
wireless spoofing attacks that can be highly effective for 
phishing and spam campaigns; and malicious Tracknets 
that profile and track the whereabouts of wifi users. Such 
threats are greatly amplified by the increasingly dense 
deployment of wifi Access Points, and by the limited use 
of wireless security mechanisms such as 802.1 1i. Our re- 
sults suggest that the density of large metropolitan areas 
has a profound impact on the severity of the threat. 

Some specific contributions of this work include the 
modeling of fast, proximity-based worm propagation in 
metropolitan areas using real data from wardriving maps, 
wifi worm propagation using browser vulnerabilities, 
retrofitting of reactive mechanisms for wireless worm 
detection, spoofing defenses that are easy to implement, 
discussion of the whisper attack and defenses, and using 
RSS feeds to track users. 

Our primary intention with this study is to raise aware- 
ness on the threats of wireless networks, specifically in 
densely populated areas, and to explore possible counter- 
measures. Much of the problem lies in the limited use of 
802.111. The wider deployment of 802.111 would reduce 
the risks significantly, but it would not completely elim- 
inate them. More specifically, it would counter several 
instances of the spoofing threat; but it would only slow 
down, rather than mitigate wildfire worms; and it would 
not by itself eliminate the Tracknet threat, as MAC ad- 
dresses remain unencrypted in 802.11i and other means 
of profiling may be possible. 

Perhaps one of the main reasons behind the limited 
adoption of 802.11i is poor usability, as it involves con- 
figuration, and, once again, burdening users with yet 
another set of passwords or keys. Wider adoption re- 
quires convincing users that the extra trouble is worth 
it, by raising awareness on the risks of keeping wireless 
LANs open and unencrypted. We hope that our study 
contributes to this cause. 

Improving usability of wireless security standards, if 
feasible, is another path to improving adoption, but until 
such adoption is achieved and to counter the remaining 
threats, we have also suggested a variety of countermea- 
sures, which we have implemented and evaluated exper- 
imentally. Users may want to guard themselves against 
threats such as those described here, without having to 
take the cost of closing down their network using 802.1 1i 
or WEP. 
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Abstract 


Anonymization of network traces is widely viewed as a 
necessary condition for releasing such data for research 
purposes. For obvious privacy reasons, an important goal 
of trace anonymization is to suppress the recovery of 
web browsing activities. While several studies have ex- 
amined the possibility of reconstructing web browsing 
activities from anonymized packet-level traces, we ar- 
gue that these approaches fail to account for a number 
of challenges inherent in real-world network traffic, and 
more so, are unlikely to be successful on coarser Net- 
Flow logs. By contrast, we develop new approaches that 
identify target web pages within anonymized NetFlow 
data, and address many real-world challenges, such as 
browser caching and session parsing. We evaluate the 
effectiveness of our techniques in identifying front pages 
from the 50 most popular web sites on the Internet (as 
ranked by alexa.com), in both a closed-world experiment 
similar to that of earlier work and in tests with real net- 
work flow logs. Our results show that certain types of 
web pages with unique and complex structure remain 
identifiable despite the use of state-of-the-art anonymiza- 
tion techniques. The concerns raised herein pose a threat 
to web browsing privacy insofar as the attacker can ap- 
proximate the web browsing conditions represented in 
the flow logs. 


1 Introduction 


Recently, significant emphasis has been placed on the 
creation of anonymization systems to maintain the pri- 
vacy of network data while simultaneously allowing 
the data to be published to the research community 
at large [23, 24, 17, 9, 22]. In general, the goals 
of anonymization are (i) to hide structural information 
about the network on which the trace is collected, so that 
disclosing the anonymized trace does not reveal private 
information about the security posture of that network, 
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and (ii) to prevent the assembly of behavioral profiles for 
users on that network, such as the web sites they browse. 


Our goal in this paper is to evaluate the strength of 
current anonymization methodology in achieving goal 
(ii). Specifically, we focus on providing a realistic as- 
sessment of the feasibility of identifying individual web 
pages within anonymized NetFlow logs [4]. Our work 
distinguishes itself from prior work by operating on flow- 
level data rather than packet traces, and by carefully 
examining many of the practical concerns associated 
with implementing such identification within real net- 
work data. Previous work has focused on methods for 
web page identification within encrypted or anonymized 
packet trace data utilizing various packet-level features, 
such as size information, which cannot be readily scaled 
to flow-level data. Rather than assume the presence of 
packet-level information, our work instead focuses on the 
use of flow-level data from NetFlow logs to perform sim- 
ilar identification. Since NetFlow data contains a small 
subset of the features provided in packet traces, we are 
able to provide a general method for identifying web 
pages within both packet trace and NetFlow data. Also, 
use of NetFlow data is becoming more commonplace in 
network and security research [13, 21, 33, 5]. 


More importantly, our primary contribution is a rigor- 
ous experimental evaluation of the threat that web page 
identification poses to anonymized data. Though pre- 
vious work has provided evidence that such identifica- 
tion is a threat, these evaluations do not take into ac- 
count several significant issues (e.g., dynamic web pages, 
browser caching, web session parsing, HTTP pipelining) 
involved with the application of deanonymizing tech- 
niques in practice. To overcome these obstacles to practi- 
cal identification of web pages, we apply machine learn- 
ing techniques to accommodate variations in web page 
download behavior!. Furthermore, our techniques can 
parse and identify web pages even within multiple inter- 
leaved flows, such as those created by tabbed browsing, 
with no additional information. The crux of our identifi- 
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cation method lies in modeling the web servers which 
participate in the download of a web page, and using 
those models to find the corresponding servers within 
anonymized NetFlow data. Since the behavior of each 
server, in terms of the flows they serve, is so dynamic, 
we apply kernel density estimation techniques to build 
models that allow for appropriate variations in behavior. 

Simply finding web servers is not enough to accurately 
identify web pages, however. Information such as the or- 
der in which the servers are contacted, and which servers 
are present can have significant impact on the identifi- 
cation of web pages. In fact, the ordering and presence 
of these servers may change based on various download 
scenarios, such as changes in browser cache or dynamic 
web page content. To capture these behaviors, we for- 
malize the game of “20 Questions” as a binary Bayes be- 
lief network, wherein questions are asked to narrow the 
possible download scenarios that could explain the pres- 
ence of a web page within the anonymized data. As such, 
our approach to web page identification begins with iden- 
tifying likely servers and then employs the binary Bayes 
belief network to determine if those servers appropriately 
explain the presence of the targeted web page within the 
data. 

Lastly, the evaluation of our techniques attempts to 
juxtapose the assumptions of closed world scenarios 
used in previous work to the realities of identifying web 
pages in live network data. The closed world evalua- 
tion of data collected through automated browsing scripts 
within a controlled environment was found to perform 
well — detecting approximately 50% of the targeted web 
pages with less than 0.2% false detections. In more re- 
alistic scenarios, however, true detection and false de- 
tection rates varied substantially based upon the type of 
web page being identified. Our evaluation of data taken 
through controlled experiments and live network cap- 
tures shows that certain types of web pages are easily 
identifiable in real network data, while others maintain 
anonymity due to false detections or poor true detec- 
tion rates. Additionally, we show the effects of locality 
(i.e., different networks for collecting training and test- 
ing data) on the detection of web pages by examining 
three distinct datasets taken from disparate network en- 
vironments. In general, our results show that information 
leakage from anonymized flow logs poses a threat to web 
browsing privacy insofar as an attacker is able to approx- 
imate the basic browser settings and network conditions 
under which the pages were originally downloaded. 


2 Background and Related Work 


Network trace anonymization is an active area of re- 
search in the security community, as evidenced by 
the ongoing development of anonymization methods 


(e.g., [9, 23, 30]) and releases of network data that they 
enable (e.g., [26, 7]). Recently, several attacks have been 
developed that illustrate weaknesses in the privacy af- 
forded by these anonymization techniques. In particu- 
lar, both passive [6] and active attacks [2, 3] have shown 
that deanonymization of public servers and recovery of 
network topology information is possible in some cases. 
Until now, however, an in-depth examination of the ex- 
tent to which the privacy of web browsing activities may 
also be at risk has been absent. 

It would appear that existing approaches for infer- 
ring web browsing activities within encrypted tunnels 
[19, 32, 11, 1, 18, 8]) would be directly applicable to the 
case of anonymized network data—in both cases, pay- 
load and identifying information (e.g., IP addresses) for 
web sites are obfuscated or otherwise removed. These 
prior works, however, assume some method for unam- 
biguously identifying the connections that constitute a 
web page retrieval. Unfortunately, as we show later, this 
assumption substantially underestimates the difficulty of 
the problem as it is often nontrivial to unambiguously 
delineate the flows that constitute a single page retrieval. 
The use of NetFlow data exacerbates this problem. Fur- 
thermore, as we show later, there are several challenges 
associated with the modern web environment that exac- 
erbates the problem of web page identification under re- 
alistic scenarios. 

To our knowledge, Koukis et al. [14] present the 
only study of web browsing behavior inference within 
anonymized packet traces, which anticipates some of 
the challenges outlined herein. In their work, however, 
the authors address the challenges of parsing web page 
downloads from packet traces by using packet inter- 
arrival times to delineate complete sessions. Though this 
delineation can be successful in certain instances, there 
are several cases where time-based delineation alone will 
not work (e.g, for interleaved browsing). In this paper, 
we address several challenges beyond those considered 
by Koukis et al. and provide a more in-depth evalua- 
tion that goes further than their exploratory work. More- 
over, our work differs from all prior work on this problem 
(of which we are aware) in that it applies to flow traces, 
which offer far coarser information than packet traces. 


3 Identifying Web Pages in Anonymized 
NetFlow Logs 


The anonymized NetFlow data we consider consists of 
a time-ordered sequence of records, where each record 
summarizes the packets sent from the server to the 
client within a TCP connection. These unidirectional 
flow records contain the source (server) and destination 
(client) IP addresses, the source and destination port 
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numbers, timestamps that describe the start and end of 
each TCP connection, and the total size of the traffic 
sent from the source to the destination in the flow (in 
bytes). The NetFlow format also contains a number of 
other fields that are not utilized in this work. For our pur- 
poses, we assume that the anonymization of the NetFlow 
log creates consistent pseudonyms, such as those created 
by prefix-preserving anonymization schemes [9, 23], for 
both the source and destination IP addresses in these 
records. Furthermore, we assume that the NetFlow data 
faithfully records TCP traffic in its entirety. 

The use of consistent pseudonym addresses allows us 
to separate the connections initiated from different hosts, 
thereby facilitating per host examination. Additionally, 
we assume that port numbers and sizing information are 
not obfuscated or otherwise altered to take on inconsis- 
tent values since such information is of substantial value 
for networking research (e.g., [10, 29, 12]). The unal- 
tered port numbers within the flows allow us to filter the 
flow records such that only those flows originating from 
port 80 are examined’. 

Initially, we also assume that web browsing sessions 
(i.e., all flows that make up the complete download of a 
web page) can be adequately parsed from the NetFlow 
log. A similar assumption is made by Sun et al. [32] 
and Liberatore et al. [18]. Though previous work has 
assumed that web browsing session parsing algorithms 
are available, accurate web session parsing is, in fact, 
difficult even with packet traces and access to payload 
information [31, 15]. In 86, we return to the difficulty 
of parsing these sessions from real anonymized network 
data. By adopting the assumption (for now) that accurate 
web browsing session parsing can be done, it becomes 
possible to parse the complete NetFlow data into non- 
overlapping subsequences of flow records, where each 
subsequence represents a single, complete web brows- 
ing session for a client. Given the subsequent client web 
browsing sessions, our goal is to extract features that 
uniquely identify the presence of target web pages within 
the anonymized NetFlow data, and model their behavior 
in a manner that captures realistic browsing constraints. 


3.1 Feature Selection 


The most intuitive feature for discovering web pages in 
the anonymized NetFlow data is the sequence of flow 
sizes observed during a complete web browsing session. 
Each flow in the web browsing session is represented by 
an index number indicating its ordering in the session, 
and an associated flow size indicating the amount of data 
transferred during the flow. Naively, one would expect 
that the use of flow size, index pairs would suffice as a 
good distinguisher for web page identification. However, 
as Figure 1(a) shows, this is not the case. For instance, 


notice that the front page of msn.com is fairly inconsis- 
tent in the number and size of flows, and there is a sig- 
nificant amount of overlap even among only these three 
examples. Since we are examining flows, the number 
of flows and their associated sizes are dependent on the 
manner in which the client requests objects, such as pic- 
tures or text. In many cases, the sequence in which the 
objects are downloaded may change due to dynamic web 
content, or the state of the client’s browser cache may 
cause certain objects to be excluded. These changes to 
the client’s download behavior cause object drift within 
the flows, where web page objects are downloaded in 
different flows or not downloaded at all. As a result, 
the number of flows and their respective sizes can vary 
widely, and are therefore a poor indicator of the identity 
of the web page in question. 

An important observation regarding this inconsistency 
is that the size of any flow is regulated by the cumula- 
tive size of all the objects downloaded for the web page, 
less the size of all objects downloaded in prior flows. 
If a large flow early in the browsing session retrieves a 
significant number of objects, then the subsequent flows 
must necessarily become smaller, or there must be fewer 
flows overall. Conversely, a session of many small flows 
must necessarily require more flows overall. In fact, 
if we examine the cumulative perspective of web page 
downloads in Figure 1(b), we find that not only are these 
sites distinguishable, but that they take consistent paths 
toward their target cumulative size. 

The existence of such paths and the inherent connec- 
tion between flow size, index number, and cumulative 
size motivates the use of all three features in identify- 
ing web pages. These three features can be plotted in 3- 
dimensional space, as shown in Figure 2(a), and the path 
taken in this 3-dimensional space indicates the behavior 
exhibited by the download of objects for a complete web 
browsing session. Figure 2(b) shows an example of web 
browsing session paths for the front pages of both ya- 
hoo.com and msn.com overlaid on the set of points taken 
over many web browsing sessions of msn.com’s front 
page. Clearly, the path taken by yahoo.com is distinct 
from the set of points generated from web browsing ses- 
sions of msn.com, while the msn.com path remains simi- 
lar to past web browsing sessions. 


Server sessions The use of flow size, index number, 
and cumulative size information can be further refined 
by considering the sequence of flows created from each 
web server in the web browsing session, which we de- 
note as a server session. Notice that when we sepa- 
rate the flows for msn.com by the server that produced 
them, each server occupies a very distinct area of the 
3-dimensional space, as shown in Figure 3. This re- 
finement offers two benefits in identifying web pages. 
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Figure 1: (a) Sequential and (b) cumulative views of page loads for msn.com, cnn.com, and ebay.com from a single 
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Figure 2: (a) 3-D view of page loads for msn.com, cnn.com, and ebay.com from a single client; (b) Regions for 
msn.com compared to sequences of yahoo.com and msn.com as downloaded by a single client 


First, by abstracting the web browsing sessions to con- 
sist of individual server sessions, we can use the pres- 
ence or absence of servers and their relative ordering to 
further differentiate web pages. The ordering of these 
web servers provides useful information about the struc- 
ture of the web page since there is often a dependency 
between objects within the web page. For instance, the 
HTML of a web page must be downloaded before any 
other objects, and thus the first server contacted must be 
the primary web server. Second, by refining our flow 
information on a per server basis, we can create a fine 
grained model of the behavior of the web browsing ses- 
sion. If done correctly, the problem of identifying a web 
page within anonymized NetFlow data can be reduced 
to one of identifying the servers present within a given 
web browsing session based on the path created by the 
flows they serve, and the order in which the servers are 
contacted. 


Logical servers Intuitively, we could simply use the 
flows served by each distinct web server IP address 
(which we refer to as a physical server) to create the 
3-dimensional space that describes the expected behav- 
ior of that physical server in the web browsing session. 
However, the widespread use of Content Delivery Net- 
works (CDNs) means that there may be hundreds of dis- 
tinct physical web servers that serve the same web ob- 
jects and play interchangeable roles in the web brows- 
ing session. These farms of physical servers can actually 
be considered to be a single logical server in terms of 
their behavior in the web browsing session. Therefore, 
the 3-dimensional models we build are derived from the 
samples observed from all physical servers in the logical 
server group. 


Of course, the creation of robust models for the detec- 
tion of web pages requires that the data used to create the 
models reflect realistic behaviors of the logical servers 
and the order in which they are contacted. There are a 
number of considerations which may affect the ability 
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Figure 3: 3-D view of msn.com separated by server as observed by a single client 


of data to accurately predict the behavior of web page 
downloads. These considerations are especially impor- 
tant when an attacker is unable to gain access to the 
same network where the data was collected, or when 
that data is several months old. Liberatore et al. have 
shown that the behavioral profiles of web pages, even 
highly dynamic web pages, remain relatively consistent 
even after several months [18], though the effects of web 
browser caching behavior and the location where the net- 
work data was captured have not yet been well under- 
stood. 


4 An Automated Classifier for Web Pages 
in NetFlows 


In this section, we address the problem of building auto- 
mated classifiers for detecting the presence of target web 
pages within anonymized NetFlow data. Through the use 
of features discussed in 83, we create a classifier for each 
web page we wish to identify. The classifier for a tar- 
get web page consists of the 3-dimensional spaces for 
each of its logical servers, which we formalize by using 
(i) kernel density estimates [28], and (ii) a series of con- 
straints for those logical servers, formalized by a binary 
Bayes belief network [20]. The goal of the classifier is to 
attempt to create a mapping between the physical servers 
found in the anonymized web browsing session and the 
logical servers for the target web page, and then to use 
the mapping to evaluate constraints on logical servers for 
the web page in question. These constraints can include 
questions about the existence of logical servers within 
the web browsing session, and the order in which they 
are contacted by the client. If the mapping meets the 
constraints for the given web page, then we assume that 
the web page is present within the web browsing session; 
otherwise, we conclude it is not. 

There are several steps, illustrated in Figure 4, that 
must be performed on the anonymized NetFlow logs in 
order to accurately identify web pages within them. Our 
first step is to take the original NetFlow log and parse 


the flow records it contains into a set of web browsing 
sessions for each client in the log. Recall that our initial 
discussion assumes the existence of an efficient and ac- 
curate algorithm for parsing these web browsing sessions 
from anonymized NetFlow logs. These web browsing 
sessions, by definition, consist of one or more physical 
server sessions, which are trivially parsed by partitioning 
the flow records for each client, server pair into separate 
physical server sessions. The physical server sessions 
represent the path taken within the 3-dimensional space 
(i.e., flow size, cumulative size, and index triples) when 
downloading objects from the given physical server. At 
this point, we take the paths defined by each of the phys- 
ical servers in our web browsing session, and see which 
of the logical servers in our classifier it is most similar 
to by using kernel density estimates [28]. Therefore, a 
given physical server is mapped to one or more logical 
servers based on its observed behavior. This mapping in- 
dicates which logical servers may be present within our 
web browsing session, and we can characterize the iden- 
tity of a web page by examining the order in which the 
logical servers were contacted using a binary belief net- 
work. If we can satisfy the constraints for our classifier 
based upon the logical servers present within the web 
browsing session, then we hypothesize that an instance 
of the web page has been found. In §4.1 and 84.2, we 
discuss how the kernel density estimates and binary be- 
lief networks are created, respectively. 


4.1 Kernel Density Estimation 


In general, the kernel density estimate (KDE) [28] uses a 
vector of samples, S =< 51, 52,..., Sn > to derive an es- 
timate for a density function describing the placement of 
points in some d-dimensional space. To construct a KDE 
for a set of samples, we place individual probability dis- 
tributions, or kernels, centered at each sample point s;. 
In the case of Gaussian kernels, for instance, there would 
be n Gaussian distributions with means of 51, S9, ..., Sn, 
respectively. To control the area covered by each distri- 
bution, we can vary the so-called bandwidth of the ker- 
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Figure 4: General overview of our identification process 


nel. For a Gaussian kernel, its bandwidth is given by the 
variance (or covariance matrix) of the distribution. Intu- 
itively, a higher bandwidth spreads the probability mass 
out more evenly over a larger space. 

Unfortunately, determining the appropriate bandwidth 
for a given set of data points is an open problem. One ap- 
proach that we have found to produce acceptable results 
is to use a “rule of thumb” developed by Silverman [28] 
and refined by Scott [27]. The bandwidth is calculated 
as Ha = N-MEHoGA, for each dimension A = 1...d, 
where JN is the number of kernels, d is the number of di- 
mensions for each point, and oq is the sample standard 
deviation of the A dimension from the sample points in 
S. The primary failing of this heuristic is its inability to 
provide flexibility for multi-modal or irregular distribu- 
tions. However, since this heuristic method provides ad- 
equate results for the problem at hand, we forego more 
complex solutions at this time. 

Once the distributions and their associated bandwidths 
have been placed in the 3-dimensional space, we can cal- 
culate the probability of a given point, ¢ ;, under the KDE 
model as: 


n—-1 


1 
P(t;) = — » Ps, (ty) (1) 
where P;,(t;) is the probability of point t; under the ker- 


nel created from sample point s;. 


4.1.1 Application to Web Page Identification 


To apply our anonymized NetFlow data to a KDE model, 
we take the set of paths defined by the triples of flow size, 
cumulative size, and index in each of the physical server 
sessions of our training data and use them as the sample 
points for our kernels. We choose the Gaussian distribu- 
tion for our kernels because it allows us to easily eval- 
uate probabilities over multiple dimensions. The band- 
width of these distributions is calculated as described 
previously, except that for the flow size and cumulative 
size dimensions, we take the average standard deviation 
across all index values as the bandwidth. Furthermore, 
we bound the bandwidth in each dimension such that it 


is always > 1 to allow for some minimum amount of 
variability. 

To evaluate an anonymized physical server session on 
a particular KDE model, we simply evaluate each point 
in the path for that physical server session using Eqn. 1, 
and calculate the total probability of the given physical 
server session, tf, as: 


P(t) = T] Py) (2) 
j=0 


where 7; is the j*” point in the physical server session 
t, and m is the number of flows in t. For classifica- 
tion, we consider any physical server session whose path 
has a non-zero probability (from Eqn. 2) under the given 
model to be a mapping between the logical server repre- 
sented by the model and the physical server session be- 
ing evaluated. Of course, it may be possible for physical 
server session ¢ to follow a path that matches portions 
of several disjoint paths in the KDE model without ex- 
actly matching any paths in their entirety. Consequently, 
the path would achieve a non-zero probability despite the 
fact that it is not similar to any of the paths in the model. 
To prevent such situations from occurring, we apply lin- 
ear interpolation to each pair of points representing con- 
secutive flow indices on a path to create sample points at 
half index intervals. 

The use of path probabilities alone, however, is insuf- 
ficient in uniquely describing the behavior of the logical 
server. To see why, consider the case where we have 
a model for a logical server which typically contains 
ten or more total flows. It may be possible for a much 
smaller physical server session with one or two flows to 
achieve a non-zero probability despite the fact that there 
clearly was not an adequate amount of data transferred. 
To address this, we also create a KDE model for the fi- 
nal points in each sample path during training, denoted 
as end points. These end points indicate the requisite cu- 
mulative size and number of flows for a complete session 
with the given logical server. As before, we create distri- 
butions around each sample end point, and calculate the 
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probability of the physical server session’s end point by 
applying Eqn. 1. Any anonymized physical server ses- 
sion which has a non-zero probability on both their path 
and their end points for a given logical server model is 
mapped to that logical server. 


Automatically Building Logical Server KDE Models 
To create our logical server models, we use two heuris- 
tics to group physical servers into logical server groups. 
First, if two physical servers in our training data use the 
same hostname and serve the exact same HTTP URL, 
we can assume they are the same logical server and their 
sample points can be merged into a single KDE model. 
Since we are in control of our training data, we can col- 
lect packet traces to find URL and hostname information 
before converting the data to NetFlow format to create 
the paths that will make up our KDE models. It is of- 
ten the case, however, that different hostnames are used 
among physical servers in the same logical server group, 
and this may prevent some of the physical servers in our 
training data from being placed into the correct logical 
server groups. 

To address this, we apply a second heuristic that 
merges these remaining physical servers by examining 
the behavior exhibited by their KDE models. If a ran- 
domly selected path and its end point from a given phys- 
ical server’s training data achieves non-zero probability 
on the KDE model of another physical server, then those 
two physical servers can be merged into a single logical 
server. The combination of these two heuristics allow us 
to reliably create KDE models that represent the logical 
servers found in the web browsing session. By applying 
the points found in an anonymized physical server ses- 
sion to each of the KDE models for a given web page, 
we can create candidate mappings from the anonymized 
physical server to the logical servers for the target web 


page. 


4.2 Binary Bayes Belief Networks 


As discussed earlier, we formalize the constraints on 
the logical servers using a binary Bayes belief network 
(BBN). In a typical Bayes network, nodes represent 
events and are inter-connected by directed edges which 
depict causal relations. The likelihood that a given event 
occurs is given by the node’s probability, and is based on 
the conditional probability of its ancestors. In the binary 
Bayes belief network variant we apply here, we simply 
use a binary belief network where events have boolean 
values and the causal edges are derived from these val- 
ues [20]. 

An example of a binary belief network is given in Fig- 
ure 5, where the probability of event y is conditioned 
upon event —z. One way of thinking of this network is 


J is-present? 
false true 
©) causal 
relationships 
x. hs 
false true false true 


y \ 
O noe @) O | relative sizes 


Figure 5: Example BBN 


as a strategy for the game of “20 Questions” where the 
player attempts to identify an object or person by asking 
questions that can only be answered with “Yes’ or ‘No’ 
responses. Our binary belief network is simply a formal- 
ization of this concept (though we are not limited to ask 
20 questions), where the answer to any question dictates 
the best strategy for asking future questions. 


To create the belief network, we first decide upon a 
set of questions (or events) that we would like to evalu- 
ate within the data. In the context of web privacy, these 
events relate to the existence of logical servers, their 
causal relationships, and cumulative size. The belief net- 
work can be created automatically by first examining all 
possible existence and ordering events that occur within 
the training data. Next, from this set of events, we can 
simply select the event whose probability of being True 
in the training data is highest among all events. Having 
done so, the training data can then be partitioned into two 
groups: one group whose data has the value True for the 
selected event, and another whose value is False for 
that event. The selected event is then removed from the 
set of possible events and each partition of training data 
now selects another event from the remaining set whose 
probability on their respective data is highest. 


This partitioning process is repeated recursively, al- 
lowing each branch to grow independently. A given 
branch halts its recursion when its conditional proba- 
bility for an event is < €. The conditional probability 
threshold, €, indicates the percentage of the training data 
that remains at a given leaf node, and therefore we stop 
our recursion before the tree becomes overfitted to our 
training data. Any leaf node that halts recursion with 
some amount of training data remaining is considered as 
an accepting node, and all other leaf nodes are labeled 
as rejecting nodes. Accepting nodes implement one ad- 
ditional check to ensure that the total size of all flows 
in the web browsing session is within +10% of the total 
sizes observed during training. 
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5 A Closed-world Evaluation 


To gauge the threat posed by the our web page identifica- 
tion techniques—and to place our results in context with 
prior work—we first provide an evaluation under a clean, 
closed world testing model. Prior work on this topic also 
focuses on the evaluation of identification techniques 
based on a controlled network environment, browsing 
a set of target web pages across an encrypted tunnel 
[32], through a proxy server [18], or within anonymized 
packet traces [14]. Each of these works, with the excep- 
tion of [14], also assume that the web browsing session 
can be easily parsed from the stream of packets crossing 
the encrypted tunnel or proxy server. In what follows, we 
also adopt this assumption for this particular evaluation, 
though we will re-visit the inherent challenges with web 
browsing delineation in $6. 

In short, our initial evaluation is considered under con- 
trolled environments similar to past work, but with two 
notable differences. First, in the scenarios we exam- 
ine, there is substantially less data available to us than 
at the packet trace level; recall that NetFlows aggre- 
gate all packets in a flow into a single record. Second, 
rather than assuming that the client’s browser cache is 
turned off, we attempt to simulate the use of caching in 
browsers in our training and testing data. The simula- 
tion of browser caching behavior was implemented by 
enabling the default caching and cookie policies within 
Mozilla Firefox™, and browsing to the sites in our target 
set at random. Of course, this method of cache simula- 
tion is not entirely realistic, as the probability of a cache 
hit is directly proportional to the frequency with which 
the user browses that web site. However, in lieu of mak- 
ing any assumptions on the distribution of web brows- 
ing for a given user, we argue that for the comparison at 
hand, the uniform random web browsing behavior pro- 
vides an adequate approximation. 


Data Collection The data for our closed world eval- 
uation was collected with the use of an automated 
script that used Firefox to randomly visit selected web 
pages from a set of target pages (with Adobe Flash and 
Javascript enabled). The web browser was set to the de- 
fault caching and cookie policies to ensure the most re- 
alistic behavior possible in such a closed world environ- 
ment. Specifically, the script first initiated a new Fire- 
fox instance, and opened new tabs within the single Fire- 
fox instance for each new web page visited. While these 
web pages were not loaded in parallel, several web sites 
automatically refresh themselves at given intervals, thus 
adding noise to our data whenever they appeared among 
one of the tabs of the active Firefox instance. Once four 
web pages were opened in the current Firefox instance, 
the browser was closed gracefully to allow the cache to 


flush to disk, and a new Firefox instance was loaded to 
continue the random browsing. For each visit to a web 
page, we captured the packets for that web browsing ses- 
sion and recorded it to a separate trace. The packet cap- 
tures were then converted into NetFlow logs by creat- 
ing single flow records for each TCP connection in the 
session. Notice that the use of an automated browsing 
script allowed us to cleanly delineate between browsing 
sessions, as well as to simulate cache behavior through 
random browsing. 

Our target pages were the front pages of the top 50 
most popular sites as ranked by alexa.com. Addition- 
ally, we also collected information about the front pages 
of sites ranked 51-100 on the alexa.com list for use in 
providing robust evaluation of the false detection rates of 
our technique. Though we have chosen to evaluate our 
techniques on the top 100 sites, there is nothing inherent 
in their structure that differentiates them from other web 
pages. In fact, the same techniques are equally applica- 
ble in targeting any web page of the attacker’s choosing. 

The web pages were retrieved by running the auto- 
mated browsing script on a host within the Johns Hop- 
kins University network for a total of four weeks, cre- 
ating a total of 18,525 web browsing sessions across all 
100 web pages in our list of web pages. From this data, 
we select the first 90 web browsing sessions of our target 
web pages (i.e., those within the top 50 of the alexa.com 
ranking) as the training data for the creation of the ker- 
nel density estimate (see §4.1) and binary Bayes belief 
network models (see §4.2) that make up the profiles for 
each target web page. The remaining sessions are used 
as test data and are anonymized by replacing IP ad- 
dresses within the NetFlow data with prefix-preserving 
pseudonyms according to the techniques described by 
[23]. Notice that since we assume that the web browsing 
sessions are easily parsed, we can simply use each web 
browsing session in our test data directly to determine 
if that web browsing session can be identified as any of 
the 50 web pages in our target set using the techniques 
described in §3. 


Results The results for this evaluation are given in Ta- 
ble 1. The analysis shows that our web page identifi- 
cation method performs reasonably well in the closed 
world environment. Though the overall true detection 
rate is only 48%, its associated false detection rate is 
exceptionally low at only 0.18% across all web pages. 
For comparison, using random guessing to identify web 
pages would yield an overall true detection rate of only 
2%. Moreover, keep in mind that under the goals of net- 
work data anonymization, no inference of browsing be- 
havior should be possible. 

For ease of exposition, we also partition the 50 target 
web pages into canonical categories based on the primary 
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passport.net, statcounter.com 
match.com, myspace.com 
msn.com, google.com 
imdb.com, wikipedia.org 
flickr.com, youtube.com 
microsoft.com, apple.com 
amazon.com, ebay.com 
cnn.com, nytimes.com 
monster.com, careerbuilder.com 
foxsports.net, mlb.com 





Table 1: True and false detection rates for canonical categories in closed world test 


function of the web site. Notice that the performance of 
the canonical classes varies based on the dynamism of 
the contents in the web page. For instance, some of the 
more difficult categories in terms of true detection are 
those whose front page content changes frequently, e.g, 
cnn.com. Conversely, pages with simple, static content, 
like passport.net or google.com, can be identified reason- 
ably well. Moreover, those web sites with simple lay- 
outs and little supporting infrastructure also tend to fare 
worst with respect to false detections, while complex, 
dynamic sites have few, if any, false detections. These 
initial results hint at the fact that the ability to reliably 
identify web pages is connected with the complexity and 
dynamism of the web page. In what follows, we examine 
whether these results hold under a more realistic exami- 
nation based on real world browsing. 


6 Considerations for the Real World 


The closed world evaluation in the previous section made 
several assumptions about the attacker’s ability to parse 
web sessions and simulate caching behavior. Moreover, 
since both the training and testing data were collected at 
the same location, the effects of locality on the effective- 
ness of the identification techniques were not accounted 
for. These assumptions lead to a disconnect between the 
results of our closed world testing and those that can be 
expected in realistic attacks on anonymized data. In or- 
der to perform a rigorous evaluation of the real threats 
posed by such identification techniques, we must address 
several issues, including web browsing session parsing 
and caching behavior. 


Web Browsing Session Parsing One of the biggest 
concerns with the closed world evaluation in §5 is that 
there is an implicit assumption that parsing web sessions 
from live network data is a simple and accurate task. 
There has been extensive work in attempting to parse 


packet traces into web browsing sessions, yet much of 
this work requires access to plaintext payloads, and re- 
sults show that this parsing is not completely accurate 
[16, 31, 15]. To our knowledge, there is no prior art 
on performing similar parsing on NetFlow data. Koukis 
et al. attempted to use a heuristic of packet inter-arrival 
times to delineate sessions in packet traces, but their 
techniques were only able to correctly identify 8% of 
the web browsing sessions—underscoring the difficulty 
of the problem [14]. 


Fortunately, our kernel density estimate (KDE) and bi- 
nary Bayes belief network (BBN) models can be modi- 
fied to overcome the challenges of web browsing ses- 
sion parsing without significant changes to our identifi- 
cation process. In our previous evaluation, we assumed 
that the KDE and BBN models were given a subsequence 
of the original NetFlow log that corresponded to a com- 
plete web browsing session for a single client. For our 
real world evaluation, however, we remove this assump- 
tion. Instead, to parse the NetFlow log, we assume that 
all flows of a given web browsing session are clustered 
in time, and partition the NetFlow log into subsequences 
such that the inter-arrival times of the flows in the parti- 
tion is < 6 = 10 seconds. This assumption is similar to 
that of Koukis et al. [14] and provides a coarse approx- 
imation to the web browsing sessions, but the resultant 
partitions may contain multiple web browsing sessions, 
or interleaved sessions. 


Notice that we can simply use each of the physical 
servers within these partitions as input to the KDE mod- 
els for a target web page to determine which logical 
servers may be present in the partition. Thus, we apply 
the flows from every physical server in our partitioned 
NetFlow data to the KDE models for our target page to 
create the logical server mappings. If a physical server 
in our partition does not map to a logical server, we ig- 
nore that physical server’s flows and remove it from the 
partition. Thus, by removing these unmapped physical 
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servers, we identify a candidate web browsing session 
for our target site. Since the BBN operates directly on 
the mappings created by the KDE models, we traverse 
the BBN and determine if the web page is present based 
on the physical servers that were properly mapped. This 
technique for finding web browsing sessions is particu- 
larly robust since it can find multiple web pages within a 
single partition, even if these web pages have been inter- 
leaved by tabbed browsing. 


Browser Cache Behavior Another serious concern in 
our closed world evaluation is the variability of web 
browsing session behavior due to the client’s browser 
cache. In our closed world evaluation, we created our 
models from data collected by an automated script that 
randomly browsed the front pages from among the top 
100 sites according to alexa.com. The use of uniform 
random browsing with the default cache policy, how- 
ever, does not accurately reflect the objects that would 
be cached by real clients. In reality, the client’s browser 
cache would tend to hold more objects from the most 
frequently visited web pages, making the cache states 
highly specific to the client. Clearly, using our simu- 
lated caching data alone is not enough to create models 
that are able to detect both frequently and infrequently 
visited sites. To alleviate this shortcoming, we create a 
second set of training data by setting the browser’s cache 
limit to 1.5GB. With such a large browser cache, objects 
should not be evicted from the cache even when we per- 
form our random browsing, thereby allowing us to gain 
information about web browsing behaviors for our tar- 
get sites when they are viewed frequently. The training 
data that we use to create our models now consists of 90 
web browsing sessions of simulated cache data, and 64 
browsing sessions of unlimited cache data for each tar- 
get site. The procedure for building our models remains 
the same, except we now use the flow records from both 
cache scenarios. 


Results To provide a more realistic evaluation of the 
threat our identification techniques pose to anonymized 
NetFlow data, we re-examine its performance on three 
distinct datasets. First, we use the testing data from our 
closed world evaluation to measure the effect that the in- 
troduction of unlimited cache data and web session pars- 
ing have on the performance of our technique. Second, 
we capture web browsing sessions from different net- 
work providers in Maryland, and in Pennsylvania. By 
comparing the performance of our technique on these 
three datasets, we can glean insight into the effects of lo- 
cality on the success of attacks on anonymized NetFlow 
data. 

The effects that the changes to our models have on 
the performance of our technique are shown in Table 


2. Clearly, the false detection rate increases substan- 
tially, but the true detection rate also increases. As in 
the closed world scenario, we find that the web pages 
with constantly changing content are more difficult to 
detect than static web pages, and that those sites with 
complex structure (i.e., many logical servers, and many 
flows) achieve a significantly lower false detection rate 
than those sites with simple structure. The substantial 
change in performance can be explained by the relax- 
ation of the BBN constraints to allow for web browser 
session parsing. This relaxation allows any web brows- 
ing session where a subset of physical servers meets the 
remaining constraints to be identified, thereby causing 
the increase in both true detection and false detection 
rate. A more detailed analysis of the implications of 
these effects is provided in §7. 


It is often the case that published network data is taken 
at locations where an attacker would not have access to 
the network to collect training data for her models, and 
sO We investigate the effect that the change in locality 
has on the performance of our technique. The results 
in Table 2 show that there is, indeed, a drop in perfor- 
mance due to changes in locality, though trends in true 
detection and false detection rates still hold. In our eval- 
uation, we noticed that the Johns Hopkins data used to 
train our web page models included a web caching server 
that caused significant changes in the download behavior 
of certain web pages. These changes in behavior in turn 
explain the significant difference in performance among 
data collected at different localities. It would appear that 
these results are somewhat disconcerting for a would-be 
attacker, since she would have to generate training data at 
a network that was different from where the anonymized 
data was captured. However, she could make her training 
more robust by generating data on a number of networks, 
perhaps utilizing infrastructure such as PlanetLab [25], 
though the effects of doing so on the performance of the 
technique are unknown. By including web page down- 
load behavior from a number of networks, she can en- 
sure that the KDE and BBN models for each target web 
page are robust enough to handle a variety of network 
infrastructures. 


7 A Realistic Threat Assessment 


Finally, we provide a threat assessment by applying our 
technique to live data collected from a public wireless 
network at Johns Hopkins University Security Institute 
over the course of 7 days. From this data, we examine 
the expected real world accuracy of our techniques and 
discuss the features that make some web pages prime tar- 
gets for identification. 
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7558 | 1328 | 2686 2439 
Table 2: True and false detection rates for canonical categories in JHU data, and comparison to remote datasets 


WebPage [TD] [PDH 
100.00 


Aafiihes com 


digg.com 


washingtonpost.com 


cnn.com 
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yahoo.com : : 
: ‘ Jacebook.com 0.00 0.07 
amazon.com 0.00 0.78 
[Corporate [_uspsicom | 0.00] 4.06 _| 
Pp Overal’ PY 88S]8 


Table 3: True and false detection rates for web pages in live network data 


Results The results of our experiment on live data, 
shown in Table 3, provides some interesting insight 
into the practicality of identifying web pages in real 
anonymized traffic. In our results from local testing data 
collected via automated browsing, we observe that cer- 
tain categories made up mostly of simple, static web 
pages (e.g., search engines) provide excellent true detec- 
tion rates, while web pages whose content changes often 
(e.g., news web pages) perform significantly worse. Fur- 
thermore, categories of sites with complex structure (i.e., 
many logical servers) generally have exceptionally low 
false detection rates, while categories of simple sites with 
fewer logical servers produce extremely high false detec- 
tion rates. Upon closer examination, not all web pages 
within a given category perform similarly despite hav- 
ing similar content and function. For instance, cnn.com 
and nytimes.com have wildly different performance in 
our live network test despite the fact that both pages have 





rapidly changing news content. 


To better understand this difference in our classifier’s 
performance for different web pages, we examine three 
features of the page loads in our automated browsing ses- 
sions for each site: the number of flows per web brows- 
ing session, number of physical servers per web brows- 
ing session, and the number of bytes per flow. Our goals 
lie in understanding (7) why two sites within the same 
category have wildly different performance, and (ii) why 
simple web pages introduce so many more false detec- 
tions over more complex web pages. To quantify the 
complexity of the web page and the amount of varia- 
tion exhibited in these features, we compute the mean 
and standard deviation for each feature across all obser- 
vations of the web page in our training data, including 
both simulated cache and unlimited cache scenarios. The 
mean values for each feature provide an idea of the com- 
plexity of the web page. For instance, a small average 
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number of physical servers would indicate that the web 
page does not make extensive use of content delivery net- 
works. The standard deviations tell us how consistent the 
structure of the page is. These statistics offer a concise 
measure of the complexity of each web page, enabling 
us to objectively compare sites in order to determine why 
some are more identifiable than others. 


Returning to the difference in true detection rates for 
cnn.com and nytimes.com, the front pages of both sites 
have constantly changing news items, a significant num- 
ber of advertisements, and extensive content delivery 
networks. One would expect that since the web pages 
change so rapidly, that our accuracy in classifying them 
would be comparable. Instead, we find that we can de- 
tect nearly half of the occurrences of cnn.com in live 
network data, while we never successfully detect ny- 
times.com. In Table 4, we see that both web pages have 
similar means and standard deviations for both flow size 
and number of physical servers. This similarity is likely 
due to the nature of the content these two sites provide. 
However, the number of flows per web browsing ses- 
sion for nytimes.com is nearly double that of cnn.com. 
Moreover, nytimes.com exhibits high variability in the 
number of flows it generates, while cnn.com seems to 
use a fairly stable number of flows in all web brows- 
ing sessions. This variability makes it difficult to con- 
struct high-quality kernel density estimates for the logi- 
cal servers that support nytimes.com, so our detector nec- 
essarily performs poorly on it. 


Another interesting result from our live network eval- 
uation is that some web page models appear to match 
well with almost all other pages, and therefore cause an 
excessive amount of false detections. For instance, Table 
3 shows that yahoo.com has an exceptionally low false 
detection rate among all web pages in our live traffic, 
while google.com has one of the highest false detection 
rates. Both web pages, however, provide adequate true 
detection rates. In Table 5, we see that google.com and 
yahoo.com have very distinct behaviors for each of the 
features. 

The metrics for our three features show that 
google.com transfers very little data, that there is almost 
always only one physical server in the web browsing ses- 
sion, and that there are normally only one or two flows. 
On the other hand, yahoo.com serves significantly more 
data, has a substantial number of physical servers, and 
causes the browser to open several flows per web brows- 
ing session. The web browsing sessions for google.com 
and yahoo.com both exhibit very little variability, though 
yahoo.com has more variability in its flow sizes due to 
its dynamic nature. Since both web pages have relatively 
low variability for all three features, they are both fairly 
easy for our techniques to detect, which corroborates 
our earlier claim that cnn.com is easy to detect because 


of the relative stability of its features. However, since 
google.com is so simplistic, with only a single physical 
server and very few flows on average, its BBN and KDE 
models have very few constraints that must be met before 
the detector flags a match. Hence, many physical servers 
in a given NetFlow log could easily satisfy these require- 
ments, and this causes the detector to produce an exces- 
sive number of false detection. By contrast, the models 
for yahoo.com have enough different logical servers and 
enough flows per session that it is difficult for any other 
site to fit the full description that is captured in the BBN 
and KDE models. 


Discussion With regard to realistic threats to 
anonymized network data, these results show that 
there are certain web pages whose behavior is so 
unpredictable that they may be very difficult to detect in 
practice. Also, an attacker has little chance of accurately 
identifying small, simple web pages with our techniques. 
Complex web pages containing large content delivery 
networks, on the other hand, may allow an attacker 
to identify these pages within anonymized flow traces 
with low false detection rates. Finally, we have found 
that an attacker must consider the effects of locality 
on the training data used to create the target web page 
models, such as the presence of private caching servers 
or proxies. These locality effects adversely influence the 
true detection rates, but they might be overcome through 
diversification of the training data from several distinct 
locations. It is unclear how this diversification would 
affect the performance of our techniques. 

When evaluating the threat that our web page identifi- 
cation attack poses to privacy, it is prudent to consider the 
information an attacker can reliably gain, possible practi- 
cal countermeasures that might hamper such attacks, and 
the overarching goals of network data anonymization. 
With the techniques presented in this paper, an attacker 
would be able to create profiles for specific web pages of 
interest, and determine whether or not at least one user 
has visited that page, as long as those target web pages 
were of sufficient complexity. While the attacker will 
not be able to pinpoint which specific user browsed to 
the page in question with the technique presented in this 
paper, such information leakage may still be worrisome 
to some data publishers (e.g., web browsing to several 
risqué web pages). 

There are, however, practical concerns that may af- 
fect the attacker’s success aside from those described in 
this paper, such as the use of ad blocking software and 
web accelerators that dramatically alter the profiles of 
web pages. These web browsing tools could be used to 
make the attacker’s job of building robust profiles more 
difficult, as the attacker would not only have to adjust 
for locality effects, but also for the effects of the particu- 
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cnn.com 
Mean 


nytimes.com 


Std. Dev. | Mean Std. Dev. 


Number of Flows 18.44 4.21 30.69 10.62 
Number of Physical Servers | 12.79 2.27 15.32 4.14 





Flow Size (KB) 568.20 286.95 | 692.87 298.73 


Table 4: Comparison of mean and std. deviation for features of cnn.com and nytimes.com 


google.com 
Mean 


yahoo.com 


Std. Dev. | Mean — Std. Dev. 


Number of Flows 1.73 0.56 9.02 3.02 


Number of Physical Servers | 1.03 0.17 5.25 1.79 





Flow Size (KB) 13.64 


10.37 219.51 187.26 


Table 5: Comparison of mean and std. deviation for features of google.com and yahoo.com 


lar ad blocking software or web accelerators. Moreover, 
while our evaluation has provided evidence that certain 
classes of web pages are identifiable despite the use of 
anonymization techniques, it is unclear how well the true 
detection and false detection rates scale with a larger tar- 
get web page set. Therefore, our techniques appear to be 
of practical concern insofar as the attacker can approxi- 
mate the behavior of the browsers and network environ- 
ment used to download the web page. 


8 Conclusion 


In this paper, we perform an in-depth analysis of the 
threats that publishing anonymized NetFlow traces poses 
to the privacy of web browsing behaviors. Moreover, we 
believe our analysis is the first that addresses a number 
of challenges to uncovering browsing behavior present 
in real network traffic. These challenges include the ef- 
fects of network locality on the adversary’s ability to 
build profiles of client browsing behavior; difficulties 
in unambiguously parsing traffic to identify the flows 
that constitute a web page retrieval; and the effects of 
browser caching, content distribution networks, dynamic 
web pages, and HTTP pipelining. In order to accommo- 
date for these issues, we adapt machine learning tech- 
niques to our problem in novel ways. 

With regard to realistic threats to anonymized NetFlow 
data, our results show that there are certain web pages 
whose behavior is so variable that they may be very dif- 
ficult to detect in practice. Also, our techniques offer an 
attacker little hope of accurately identifying small, sim- 
ple web pages with a low false detection rate. However, 
complex web pages appear to pose a threat to privacy. 
Finally, our results show that an attacker must consider 
the effects of locality on the training data used to create 
the target web page models. 

Our results and analysis seem to contradict the widely 


held belief that small, static web pages are the easiest 
target for identification. This contradiction can be ex- 
plained by the distinct differences between closed world 
testing and the realities of identifying web pages in the 
wild, such as browser caching behavior and web brows- 
ing session parsing. On the whole, we believe our study 
shows that a non-trivial amount of information about 
web browsing behaviors is leaked in anonymized net- 
work data. Indeed, our analysis has demonstrated that 
anonymization offers less privacy to web browsing traf- 
fic than once thought, and suggests that a class of web 
pages can be detected in a flow trace by a determined 
attacker. 
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Notes 


'Though machine learning techniques are certainly not the only 
method for handling variability in web pages, their application in this 
context seems to be intuitive. 

Note that even if this assumption did not hold there are still tech- 
niques that can be used to infer the presence of HTTP traffic (e.g, based 
on traffic-mix characteristics). 
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